SQL Server Gems

Thursday, January 26, 2006

Clustering (Part 2c of 6) - Simple K-means

Let us see how the simple k-means clustering algorithm works..

The input to the k-means clustering algorithm is k, the number of clusters to be identified. Critics of k-means often feel that the need to specify k is a disadvantage. Other disadvantages includes how the initial k cluster centers are selected. Many DBMS researchers have derived variants of k-means to address some of these problems.

K-means work as follows.

Given N data objects,

Step1 :
Randomly select k objects and make them the cluster centers.
(i.e. These initial k clusters each have 1 object, which is also the center of the cluster!)

Step2:
Consider the remaining (N - k) data objects. Assign each of these data objects to one of the k-clusters of which it is nearest to (e.g. based on its Euclidean distance).

Re-compute the cluster center for the cluster assigned.
Continue the process until all the (N-k) data objects are assigned.

Step3:
For all the k clusters, update their new cluster mean (i.e. center).
Re-assign the N to the clusters based on proximity to the cluster center.
Repeat until there are no more changes to the assignment (i.e. the algorithm converges!)

0 Comments:

Post a Comment

<< Home