Consensus Tight Clustering
The goal of our Consensus Tight Clustering algorithm is to find "tight" clusters: the genes whose expression patterns are particularly similar to each other. This is based on "resampling" of data. Suppose we have a basic clustering algorithm, say K-means algorithm, the idea is that if two genes belong to a tight cluster, then this basic algorithm will tend to assign these two genes to the same cluster if run on different subsets of data. Specifically, it has two stages: in the first stage, the algorithm runs K-means for a certain number of times and for multiple values of K (both are program parameters). In each run, a different subset of data is chosen for analysis: a certain ratio of genes will be randomly chosen from the origianl data set with replacement. During this stage, the algorithm records, for any pair of genes, how often they are assigned to the same cluster. This frequency is called the "comembership" between two genes. In the second stage, the algorithm constructs the comembership graph from the comembership matrix using a user-specified threshold and then performs search of cliques. It recursively finds the maximum cliques using a simple greedy heuristic [1]. The output of the algorithm is these cliques.
Parameters Description
dataFile: the microarray expression data
annotFile: the gene annotation file, may be skippend (leave it blank) if the data file has already included annotations
outputFile: output file containing clustering results
CVThr: all genes whose coefficient of variance is less than this value will be filtered out
maxExpr: all genes whose maximum expression level is less than this value will be filtered out kmin: the mininum number of clusters used in the resampling part of the algorithm kmax: the maximum number of clusters used in the resampling part of the algorithm
stepSize: step size of K, K is iterated from kmin to kmax with step size equal to this number
stoppingThr: the threshold used in assessing stopping criteria - if the relative incrase of score in K-means iteration is less than this value, the K-means algorithm will stop
kmeansSamplingRatio: the ratio of data points that will be resampled (see the description of the algorithm)
numIterations: number of runs for each value of K and each sampled subset of data
comemThr: if two genes are clustered together in less than this percentage of time during resampling, then these two genes will not be connected in comembership graph (thus cannot belong to the same tight cluster)
minClusterSize: only clusters whose size is great than this value will be considered
Reference:
[1] On the Performance of Polynomial-time CLIQUE Approximation Algorithms on Very Large Graphs, http://www.cs.bu.edu/techreports/pdf/1994-001-maxclique.pdf