Two Species Coherent Subset

Our Two Species Coherent Subset algorithm is an extension of the Consensus Tight Clustering algorithm. Its goal is to find the genes that are clustered together in both species. It first identifies tight clustering using the Consensus Tight Clustering algorithm; and then extracts the othologous genes for each tight cluster in the second species. Instead of performing another round of tight clustering because that condition might be too strong, the algorithm constructs a distance matrix of these genes, where distance between two genes is defined as Eucledian distance. And similarly a distance graph is constructed and the algorithm will find the cliques from this graph.

 

Parameters Description

exprFile: microarray expression data for the non-reference species

annotFile: annotation file for the non-reference species, skipped if expression file already has the annotations

refClusterFile: the clusteirng result on the reference species expression data

orthologFile: the same as before

outputFile: output file containing the comparative clustering results

CVThr: the same as before, applied to non-reference species

maxExpr: the same as before, applied to non-reference species

sampleSize: the same as before

percentile: the same as before

 

 

Consensus Tight Clustering

The goal of our Consensus Tight Clustering algorithm is to find "tight" clusters: the genes whose expression patterns are particularly similar to each other. This is based on "resampling" of data. Suppose we have a basic clustering algorithm, say K-means algorithm, the idea is that if two genes belong to a tight cluster, then this basic algorithm will tend to assign these two genes to the same cluster if run on different subsets of data. Specifically, it has two stages: in the first stage, the algorithm runs K-means for a certain number of times and for multiple values of K (both are program parameters). In each run, a different subset of data is chosen for analysis: a certain ratio of genes will be randomly chosen from the origianl data set with replacement. During this stage, the algorithm records, for any pair of genes, how often they are assigned to the same cluster. This frequency is called the "comembership" between two genes. In the second stage, the algorithm constructs the comembership graph from the comembership matrix using a user-specified threshold and then performs search of cliques. It recursively finds the maximum cliques using a simple greedy heuristic [1]. The output of the algorithm is these cliques.

 

Reference:

[1] On the Performance of Polynomial-time CLIQUE Approximation Algorithms on Very Large Graphs, http://www.cs.bu.edu/techreports/pdf/1994-001-maxclique.pdf