IDA@SMU Banner

QuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Classification and Differentiation

Recent advances in Metagenomics and the Human Microbiome provide a complex landscape for dealing with a multitude of genomes all at once. One of the many challenges in this field is classification of the genomes present in the sample. Effective metagenomic classification and diversity analysis require complex representations of taxa. The significance of our research is that we develop a suite of tools, based on novel quasi-alignment techniques that will be applied to environmental metagenomics samples as well as human microbiome samples. Providing such methods to rapidly classify organisms using our new approach on a laptop computer instead of several multi-processor servers will facilitate the development of fast and inexpensive devices for microbiome-based health screening in the near future. Quasi-alignment for genetic sequences was first studied by R. Kotamarti in the EMMSA project and uses the high-throughput stream mining ideas of TRACDS.

Team

Software

Related Publications

  1. Anurag Nagar and Michael Hahsler. Genomic sequence fragment identification using quasi-alignment. In Proceedings of the ACM BCB Conference 2013, Washington D.C., September 2013.
  2. Anurag Nagar and Michael Hahsler. Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment. BMC Bioinformatics, 14(Suppl. 11), 2013.
  3. Anurag Nagar and Michael Hahsler. A novel quasi-alignment-based method for discovering conserved regions in genetic sequences. In Proceedings from the IEEE BIBM 2012 Workshop on Data-Mining of Next-Generation Sequencing, October 2012.
  4. Maya El Dayeh and Michael Hahsler. Biological pathway completion using network motifs and random walks on graphs. In IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2012), pages 229-236. IEEE, May 2012.
  5. Maya El Dayeh and Michael Hahsler. Analyzing incomplete biological pathways using network motifs. In 27th Symposium On Applied Computing (SAC 2012), volume 2, pages 1355-1360. ACM, 2012.
  6. Michael Hahsler and Margaret H. Dunham. Temporal structure learning for clustering massive data streams in real-time. In SIAM Conference on Data Mining (SDM11). SIAM, 2011.
  7. R.M. Kotamarti, M. Hahsler, D.W. Raiford, M. McGee and M.H. Dunham (2010). "Analyzing Classification Using Extensible Markov Models," Bioinformatics, 26(18):2235-2241, 2010.
  8. Hahsler M, Dunham HM (2010). "rEMM: Extensible Markov Model for Data Stream Clustering in R." Michael Hahsler and Margaret H. Dunham, rEMM: Extensible Markov Model for Data Stream Clustering in R, Journal of Statistical Software, 35(5):1-31, 2010.

Acknowledgement of Support

NHGRI NIH This research is supported by research grant no. R21HG005912 from the National Human Genome Research Institute (NHGRI / NIH).

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the supporting organizations.

IDA Images