Clustering (Part 2b of 6)
A cluster consists of data objects that are similar to each other. Using clustering, we can identify dense/spare regions in the data, discover interesting patterns and correlations among the data attributes.
When a person attempts to perform grouping of objects, he/she is doing clustering. A object is usually described by a set of attributes (e.g. weight, height, age, income, etc). It is a natural activity which we sometimes engage in. As discussed in (Part 2a, below) most clustering algorithms used a distance-based metric (e.g 2-Norm or Euclidean distance). The goal is to assign a data object to a cluster that it is nearest to based on the distance of the object to the center of the cluster. We will describe a simplified version of k-means (one type of clustering algorithm) in Part 2c. A scalable version of k-means is implemented in SQL Server 2005.
If we consider just 1-2 attributes, we realized it is easy.. Even a human can do it quite effectively. However, let us consider many data objects, each with large number of attributes, which we need to use in order to find clusters, then it is not so simple to do it manually. Thus, we need a tool to help us identify the cluster in an efficient and effective manner.
Many statistical packages (S-Plus, SPSS, SAS, etc) have implemented various types of clustering methods. Previously, clustering always seem too be an out-of-DBMS process.. Data residing in DBMS have to be massaged into a form that is compatible with the specific tool you are using before you can perform clustering. The results/output from the clustering methods often cannot be easily fed back into the DBMS to be used in other queries. In SQL Server 2005, we can do clustering "in-the-DBMS". This allow us to take adavantage of the many pre-defined DBMS features that would otherwise be unavailable.
When a person attempts to perform grouping of objects, he/she is doing clustering. A object is usually described by a set of attributes (e.g. weight, height, age, income, etc). It is a natural activity which we sometimes engage in. As discussed in (Part 2a, below) most clustering algorithms used a distance-based metric (e.g 2-Norm or Euclidean distance). The goal is to assign a data object to a cluster that it is nearest to based on the distance of the object to the center of the cluster. We will describe a simplified version of k-means (one type of clustering algorithm) in Part 2c. A scalable version of k-means is implemented in SQL Server 2005.
If we consider just 1-2 attributes, we realized it is easy.. Even a human can do it quite effectively. However, let us consider many data objects, each with large number of attributes, which we need to use in order to find clusters, then it is not so simple to do it manually. Thus, we need a tool to help us identify the cluster in an efficient and effective manner.
Many statistical packages (S-Plus, SPSS, SAS, etc) have implemented various types of clustering methods. Previously, clustering always seem too be an out-of-DBMS process.. Data residing in DBMS have to be massaged into a form that is compatible with the specific tool you are using before you can perform clustering. The results/output from the clustering methods often cannot be easily fed back into the DBMS to be used in other queries. In SQL Server 2005, we can do clustering "in-the-DBMS". This allow us to take adavantage of the many pre-defined DBMS features that would otherwise be unavailable.
0 Comments:
Post a Comment
<< Home