Two Species Coherent Clustering
Our Two Species Coherent Clustering algorithm is an extension of the Consensus Tight Clustering algorithm. Its goal is to find the genes that are clustered together in both species. It first identifies tight clustering using the Consensus Tight Clustering algorithm; and then extracts the othologous genes for each tight cluster in the second species. Instead of performing another round of tight clustering because that condition might be too strong, the algorithm constructs a distance matrix of these genes, where distance between two genes is defined as Eucledian distance. And similarly a distance graph is constructed and the algorithm will find the cliques from this graph.
Parameters Description
exprFile: microarray expression data for one species (called non-reference speices)
annotFile: gene annotation file for the non-reference species (will be skipped if the expression data includes annotations)
refExprFile: microarray expression data for the second species (called reference speices)
refAnnotFile: gene annotation file for the reference species (will be skipped if the expression data includes annotations)
orthologFile: the mapping from genes of reference species to genes of non-reference species
outputFile: output file containing the comparative clustering results
CVThr: the same as before, applied to non-reference species
maxExpr: the same as before, applied to non-reference species refCVThr: the same as before, applied to reference species
refMaxExpr: the same as before, applied to reference species
kmin: the same as before, applied to reference species (reference species will be clustered first, see the algorithm description)
kmax: the same as before stepSize: the same as before
stoppingThr: the same as before kmeansSamplingRatio: the same as before numIterations: the same as before
comemThr: the same as before
minClusterSize: the same as before
sampleSize: the number of genes in the non-reference species to be sampled for determining the pairwise distance threshold, which will be used for analyzing the coherence of clusters in non-reference species (see the algorithm description)
percentile: used for determinign the pairwise distance threshold: choose the threshold so that this pentage of gene pairs in the sample will have distance less than this threshold
Consensus Tight Clustering
The goal of our Consensus Tight Clustering algorithm is to find "tight" clusters: the genes whose expression patterns are particularly similar to each other. This is based on "resampling" of data. Suppose we have a basic clustering algorithm, say K-means algorithm, the idea is that if two genes belong to a tight cluster, then this basic algorithm will tend to assign these two genes to the same cluster if run on different subsets of data. Specifically, it has two stages: in the first stage, the algorithm runs K-means for a certain number of times and for multiple values of K (both are program parameters). In each run, a different subset of data is chosen for analysis: a certain ratio of genes will be randomly chosen from the origianl data set with replacement. During this stage, the algorithm records, for any pair of genes, how often they are assigned to the same cluster. This frequency is called the "comembership" between two genes. In the second stage, the algorithm constructs the comembership graph from the comembership matrix using a user-specified threshold and then performs search of cliques. It recursively finds the maximum cliques using a simple greedy heuristic [1]. The output of the algorithm is these cliques.
Reference:
[1] On the Performance of Polynomial-time CLIQUE Approximation Algorithms on Very Large Graphs, http://www.cs.bu.edu/techreports/pdf/1994-001-maxclique.pdf