Abstract: |
Quorum Planted Motif Search (qPMS) is a specialized field of PMS which provides matching outputs only if the search motif appears in q% of the results. Designing qPMS models is a multidomain task, that involves collection of application-specific datasets, pre-processing of these datasets for identification of frequent patters, matching of these patterns, and contextual post-processing operations. Due to large-length sequences, the search process is highly complex, and requires dataset-specific optimizations. To perform these optimizations, a wide variety of tools are developed by researchers and each of them vary in terms of their qualitative & quantitative characteristics. Most of these models are non-reconfigurable, and can be used only for specific datasets, while others present highly complex search mechanisms, which limits their applicability. To overcome these limitations, this text proposes design of a Map Reduce Model for solving Quorum Planted Motif Search for high-speed deployments. The proposed model initially stores input genomic sequences via a Map Reduce framework, which assists in faster search via use of unique entity-level keys for different sequence types. These keys are stored via the Apache Hadoop framework, which assists in improving search performance under large dataset scenarios. Due to use of Map Reduce, the model is capable of higher scalability, better flexibility, low delay, and security via parallel processing operations. This was possible due to pre-processing of input DNA sequences and reducing them into index-based searchable formats. The model also deploys a Genetic Algorithm (GA) for identification of optimum Q values for enhanced accuracy under different use cases. It was tested for protein & DNA sequences, and its performance was evaluated in terms of accuracy, retrieval delay, precision, & throughput parameters, and compared with various state-of-the-art models under different use case scenarios. Based on this comparison, it was observed that the proposed model was capable achieving 3.5% higher accuracy, 9.4% lower delay, 2.9% higher precision, and 8.5% higher throughput under different scenarios. Due to these advantages the proposed model is capable of deployment for a wide variety of real-time use cases. |