Name: Waleed Amjad

Contact Information: wamja001@ucr.edu

Course: CS 234 (Professor Stefano Lonardi)

Research Interests: Databases, Information Retrieval, Texting/Data Mining and Machine Learning

Project Selected:

Metagenomics binning

Language Selected for Implementation:

JAVA and MATLAB

Progress (as of March 10 2016)

Finished the implementation of shuffling or mixing of n reads to pass on to clustering component

Completed the implementation of K-means clustering and Naive Bayes classifier

Experiments are in progress.

Progress (as of February 25 2016)

Finished implementing generation of n reads, of length l, with sequencing error a 1% rate. Currently, implementing shuffling or mixing of n reads to pass on to clustering component.

Decided to use K-means clustering.

Collecting experimental data from GenBank to be used in the project.

Progress (as of February 11 2016)

Selected first approach described below (in the updated on January 27 2016)

Started implementing generation of n reads, of length l, with sequencing error a 1% rate.

Investigating different clustering algorithm for high dimensional data including K-means.

Also looking at dimensionality reduction using SVD

Progress (as of January 27 2016)

Reading and evaluating approaches including

Suggestion provided as part of project description: To use the distribution of k-mers in each read, typically 4-mers. Represent the count of occurrences of each of the 64 possible 4-mers in the read as a 64-dimentional vector, then use a clustering algorithm on these vectors to decide where to assign the reads (e.g., k-means where k=m).

Machine learning for metagenomics: methods and tools (2015)

http://arxiv.org/pdf/1510.06621.pdf

MBBC: an efficient approach for metagenomic binning based on clustering (2015)

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0473-8