QuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Classification and Differentiation

Recent advances in Metagenomics and the Human Microbiome provide a complex landscape for dealing with a multitude of genomes all at once. One of the many challenges in this field is classification of the genomes present in the sample. Effective metagenomic classification and diversity analysis require complex representations of taxa. The significance of our research is that we develop a suite of tools, based on novel quasi-alignment techniques that will be applied to environmental metagenomics samples as well as human microbiome samples. Providing such methods to rapidly classify organisms using our new approach on a laptop computer instead of several multi-processor servers will facilitate the development of fast and inexpensive devices for microbiome-based health screening in the near future. Quasi-alignment for genetic sequences was first studied by R. Kotamarti in the EMMSA project and uses the high-throughput stream mining ideas of TRACDS.

Team

CSE at SMU: Matthew Bolaños, Margaret Dunham (Co-PI), Michael Hahsler (Co-PI and team lead), Mallik Rao Kotamarti (till 2010), Anurag Nagar (lead developer)
Stat at SMU: Monnie McGee (Co-PI and team lead), Sylvia Zhu
CS at U. Montana: Doug Raiford (team lead), Russell Kaehler, Patrick Kujawa

Software

R package QuasiAlign and QuasiAlign white paper (under development)
BioTools: For comparison we have implemented the R package BioTools with an interface to sequence alignment (clustalw, kalign) based on Biostrings from Bioconductor. The packages can be downloaded from here (zip for Windows and tar.gz for Linux/OSX). Information about installing the additional software can be found in the README file.
R package rEMM implements TRACDS which is the basis of quasi-alignment.

Related Publications

Anurag Nagar and Michael Hahsler. Genomic sequence fragment identification using quasi-alignment. In Proceedings of the ACM BCB Conference 2013, Washington D.C., September 2013.
Anurag Nagar and Michael Hahsler. Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment. BMC Bioinformatics, 14(Suppl. 11), 2013.
Anurag Nagar and Michael Hahsler. A novel quasi-alignment-based method for discovering conserved regions in genetic sequences. In Proceedings from the IEEE BIBM 2012 Workshop on Data-Mining of Next-Generation Sequencing, October 2012.
Maya El Dayeh and Michael Hahsler. Biological pathway completion using network motifs and random walks on graphs. In IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2012), pages 229-236. IEEE, May 2012.
Maya El Dayeh and Michael Hahsler. Analyzing incomplete biological pathways using network motifs. In 27th Symposium On Applied Computing (SAC 2012), volume 2, pages 1355-1360. ACM, 2012.
Michael Hahsler and Margaret H. Dunham. Temporal structure learning for clustering massive data streams in real-time. In SIAM Conference on Data Mining (SDM11). SIAM, 2011.
R.M. Kotamarti, M. Hahsler, D.W. Raiford, M. McGee and M.H. Dunham (2010). "Analyzing Classification Using Extensible Markov Models," Bioinformatics, 26(18):2235-2241, 2010.
Hahsler M, Dunham HM (2010). "rEMM: Extensible Markov Model for Data Stream Clustering in R." Michael Hahsler and Margaret H. Dunham, rEMM: Extensible Markov Model for Data Stream Clustering in R, Journal of Statistical Software, 35(5):1-31, 2010.

Acknowledgement of Support

NHGRI NIH This research is supported by research grant no. R21HG005912 from the National Human Genome Research Institute (NHGRI / NIH).

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the supporting organizations.