Progress of Final Project, CS 234

Progress of Final Project
CS 234

Zhu, Qiang
Email: a@b.c (a=qzhu; b=cs; c=ucr.edu)

1, Chose the project Gene expression analysis

Downloaded the data file, and read the requirments and data format
Considering the compuation is most about the matrix, decided to implement it with MATLAB
Had an intuition that we could use a better method to perform the clustering work on the given gene expression data. I would search related papers and try to design an effective algorithm for it (if time allowed)

2, Started to write code

Easy to normalize the data and compute the correlation coefficient
About building a graph for all nodes(genes), the guide of this project says by using "a boolean adjacency matrix of 2882 by 2882". However:
1. This graph is sparse(e/n^2<0.0017).
2. To compute the degree, we need to scan the whole array of one node.
3. Also, to check whether a node is a peak, we need to scan its whole array.
So, taking into account of the space and time complexity, adjacency matrix is not a good choice. As we know, the main alternative to the adjacency matrix is the adjacency list.

3, Constructed a data structure called "Star"

Adjacency list is hard to implement on Matlab(as no "pointer"). Considering we won't do any manipulation of deleting or adding edges after we construct the graph, a data structure called "Star" (the idea is a little similiar to adjacency list) might be the best choice for this project on Matlab.
We used two arrays. One array called "edge" stores neighbors for node i at positions point(i) to point(i+1)-1, and another array called "point" stores the position of first neighbor of each node in array "edge".

The analysis of space and time complexity of three data strutures(n is number of nodes,e is number of edges) :

Adjacency Matrix Adjacency List Star

Space n^2 n+4e n+2e

Time(degree) n^2 2e n

Time(peak) n^2 n+2e n+c

Time(construct) cn^2 cn^2+c'e cn^2+c'e

We can see Star wins out in all but time complexity of constructing the graph (In adjacency matrix, we can easily get G(i,j)=G(j,i), intead of searching all edges connected to j).

4, Query and Visualization

For checking the results, given the name of a gene, our program output its degree and all neighbors. The support to the character string is not very good in Matlab, we adjusted the length of gene names to store them in a matrix.
Generated a plot for the expression of each gene cluster. We adopted a color vector to generate different colors for each gene in one plot.

5,Test

15 gene clusters were found.
Found the gene cluster which had the peak YOL071W, and YOL052C-A was among its 23 other genes.
Plotted 15 gene clusters, and the figure of cluster YOL071W was the same as the one in h3cluster.doc(sample graph).
The whole analysis took about 80 seconds on my laptop(CPU:Intel Core2 Duo T7300,2.00GHz; RAM:3GB; Platform:Matlab 7.0, R14; OS:WinXP SP2)

6,Furthur improvement and attempt

Although it is easier to code under Matlab, due to its low efficiency, the time for analysis is a little long. We can expect a speed up by rewriting it in another language.
The step of normalization in the guide only considers the dimension of X(the shape or "vibration" of gene under 17 conditions). We may also consider in the Y dimension(the value of yeast cell cycles) if we are also interested in the gene expression levels. Or, a combination of them.
We may directly cluster the sequences instead of transforming them to a graph. Another way is to define distance or similarity between vertices of a graph based on their connectivity. (Each vertex is regarded as a vector in the adjacency matrix.)
Use other graph clusering algorithms.(eg:Markov clustering algorithm, Frey and Dueck algorithm, etc)
Use certain criteria to measure the effect of different clusterings.

(last updated at Mar 18th, 2008)

	Adjacency Matrix	Adjacency List	Star
Space	n^2	n+4e	n+2e
Time(degree)	n^2	2e	n
Time(peak)	n^2	n+2e	n+c
Time(construct)	cn^2	cn^2+c'e	cn^2+c'e