K-means algorithm
From Wikipedia, the free encyclopedia
The K-means algorithm is an algorithm to cluster objects based on attributes into k partitions. It is a variant of the expectation-maximization algorithm in which the goal is to determine the k means of data generated from gaussian distributions. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function
where there are k clusters Si, i =
1,2,...,k and μi is
the centroid or mean point of all the points
.
The algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed).
The algorithm has remained extremely popular because it converges extremely
quickly in practice. In fact, many have observed that the number of iterations
is typically much less than the number of points. Recently, however, David
Arthur and Sergei Vassilvitskii showed that there exist certain point sets on
which k-means takes superpolynomial time:
to converge.
In terms of performance the algorithm is not guaranteed to return a global optimum. The quality of the final solution depends largely on the initial set of clusters, and may, in practice, be much poorer than the global optimum. Since the algorithm is extremely fast, a common method is to run the algorithm several times and return the best clustering found.
Another main drawback of the algorithm is that it has to be told the number of clusters (i.e. k) to find. If the data is not naturally clustered, you get some strange results. Also, the algorithm works well only when spherical clusters are naturally available in data.
Contents |
[edit] Demonstration of the algorithm
The following images demonstrate the k-means clustering algorithm in action, for the two-dimensional case. The initial centers are generated randomly to demonstrate the stages in more detail.
Shows the initial
randomized centers and a number of points (actual
size here) |
Centers have been
associated with the points and have been moved to the respective centroids
(actual
size here) |
Now, the
association is shown in more detail, once the centroids have been moved.
(actual
size here) |
Again, the centers
are moved to the centroids of the corresponding associated points. (actual
size here) |
|---|
[edit] Relation to PCA
It has been shown recently [1] [2] that the relaxed solution of K-means clustering, specified by the cluster indicators, are given by the PCA (principal component analysis) principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class scatter matrix.
[edit] Variations
The set of squared error minimizing cluster functions also includes the K-medoids algorithm, an approach which forces the center point of each cluster to be one of the actual points.
[edit] References
- J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297
- D. Arthur, S. Vassilvitskii (2006): "How Slow is the k-means Method?," Proceedings of the 2006 Symposium on Computational Geometry (SoCG).
- ^ H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. "Spectral Relaxation for K-means Clustering", Neural Information Processing Systems vol.14 (NIPS 2001). pp. 1057-1064, Vancouver, Canada. Dec. 2001.
- ^ Chris Ding and Xiaofeng He. "K-means Clustering via Principal Component Analysis". Proc. of Int'l Conf. Machine Learning (ICML 2004), pp 225-232. July 2004.
[edit] External links
- Numerical Example of K means clustering
- Application example which uses K means clustering to reduce the number of colors in images
- An example application which uses K Means in Java
- kmeans application in php





