Two Species Coherent Subset
Our Two Species Coherent Subset algorithm is an extension of the Consensus Tight Clustering algorithm. Its goal is to find the genes that are clustered together in both species. It first identifies tight clustering using the Consensus Tight Clustering algorithm; and then extracts the othologous genes for each tight cluster in the second species. Instead of performing another round of tight clustering because that condition might be too strong, the algorithm constructs a distance matrix of these genes, where distance between two genes is defined as Eucledian distance. And similarly a distance graph is constructed and the algorithm will find the cliques from this graph.
Parameters Description
exprFile: microarray expression data for the non-reference species
annotFile: annotation file for the non-reference species, skipped if expression file already has the annotations
refClusterFile: the clusteirng result on the reference species expression data
orthologFile: the same as before
outputFile: output file containing the comparative clustering results
CVThr: the same as before, applied to non-reference species
maxExpr: the same as before, applied to non-reference species
sampleSize: the same as before
percentile: the same as before
Consensus Tight Clustering
The goal of our Consensus Tight Clustering algorithm is to find "tight" clusters: the genes whose expression patterns are particularly similar to each other. This is based on "resampling" of data. Suppose we have a basic clustering algorithm, say K-means algorithm, the idea is that if two genes belong to a tight cluster, then this basic algorithm will tend to assign these two genes to the same cluster if run on different subsets of data. Specifically, it has two stages: in the first stage, the algorithm runs K-means for a certain number of times and for multiple values of K (both are program parameters). In each run, a different subset of data is chosen for analysis: a certain ratio of genes will be randomly chosen from the origianl data set with replacement. During this stage, the algorithm records, for any pair of genes, how often they are assigned to the same cluster. This frequency is called the "comembership" between two genes. In the second stage, the algorithm constructs the comembership graph from the comembership matrix using a user-specified threshold and then performs search of cliques. It recursively finds the maximum cliques using a simple greedy heuristic [1]. The output of the algorithm is these cliques.
Reference:
[1] On the Performance of Polynomial-time CLIQUE Approximation Algorithms on Very Large Graphs, http://www.cs.bu.edu/techreports/pdf/1994-001-maxclique.pdf