dc.description.abstract |
A-mer counting is the process of counting k length substrings in a sequence. It is an important step in many bioinformatics applications including genome assembly, sequence error correction, and sequence alignment. Even though generating A-mer histograms seems simple and straightforward, processing large datasets efficiently with limited resources, especially memory, is very challenging. As the advancements in next-generation sequencing technologies have resulted in a tremendous growth of genomic data, it is inevitable for /r-mer counters to be faster and more efficient. A lot of work has been done in the past decade to optimize A-mer counting. Frigate, a fast and efficient tool capable of counting and querying A-mers is presented. Its inmemory design utilizes multithreaded, lock-free data structures to improve performance. Thread synchronization is handled using the compare-and-swap technique. The parallel processing pipeline of Frigate is the result of careful performance engineering and design. Frigate was developed with the emphasis on values of k less than 20, aiming to maximize performance by employing different algorithms for different ranges of k values. The performance of Frigate was compared with six state-of-the-art A-mer counters: Jellyfish, DSK, Gerbil, CHTKC, KMC2, and KMC3, using two real-world datasets. The experiments were carried out for k values of 10, 15, and 17 using a different number of threads in the range [1, 32]. The results show that Frigate achieves a comparable performance or up to 2-3x speedup compared to its competitors, especially for large datasets. The A-mer counters were analyzed based on the running time, amount of memory used, and scalability. The correctness of Frigate was evaluated by comparing the A-mer frequency histogram with those of other A-mer counters. Frigate is written in C and freely available at https: github.com Gunavaran, frigate under MIT license. |
en_US |