SIBM B- Business Analytics: Session 7/8

K Means Clustering

SPSS has three different procedures to cluster data:- hierarchical cluster analysis, K – means cluster and two step cluster.

K-means clustering was originally designed as a method that allowed very large data sets to be clustered in a feasible amount of time, when computers were rather slower than they are today. This explains its other name of "quick clustering".

K- means method of clustering is very different from the hierarchical clustering and Ward method, which are applied when there is no prior knowledge of how many clusters there may be or what they are characterized by. K-means clustering is used when you already have hypotheses concerning the number of clusters in your cases or variables. K means is used when you know how many clusters you want and you have a moderately sized data set (greater than 50 objects).

It differs from hierarchical clustering in several ways. You have to know in advance the number of clusters you want. You can’t get solutions for a range of cluster numbers unless you rerun the analysis for each different number of clusters. The algorithm repeatedly reassigns cases to clusters, so the same case can move from cluster to cluster during the analysis. In agglomerative hierarchical clustering, on the other hand, cases are added only to existing clusters. The algorithm is called k-means, where k is the number of clusters you want, since a case is assigned to the cluster for which its distance to the cluster mean is the smallest. The action in the algorithm centers on finding the k-means.

If the sample is large enough, it can be split in half with clustering performed on each and the results compared.

Initial Cluster Centers

The first step in k-means clustering is finding the k centers. This is done iteratively. You start with an initial set of centers and then modify them until the change between two iterations is small enough. K-means clustering is very sensitive to outliers, since they will usually be selected as initial cluster centers. This will result in outliers forming clusters with small numbers of cases. Before you start a cluster analysis, screen the data for outliers and remove them from the initial analysis. The solution may also depend on the order of the cases in the file.

After the initial cluster centers have been selected, each case is assigned to the closest cluster, based on its distance from the cluster centers. After all of the cases have been assigned to clusters, the cluster centers are recomputed, based on all of the cases in the cluster. Case assignment is done again, using these updated cluster centers. You keep assigning cases and re computing the cluster centers until no cluster center changes.

Final Cluster Centers

After iteration stops, all cases are assigned to clusters, based on the last set of cluster

centers. After all of the cases are clustered, the cluster centers are computed one last

time. Using the final cluster centers, you can describe the clusters.

Differences between Clusters

By finding the differences between the clusters, the clusters can be defined and the profiling of each cluster can be defined. And according to this the company can define the strategies.

The main disadvantage is that there needs to be a certain amount of trial and error in choosing the number of clusters.

K- means Cluster Analysis Options

Statistics:- You can select the following statistics: initial cluster centers, ANOVA table, and cluster information for each case.

· Initial cluster centers:- First estimate of the variable means for each of the clusters. By default, a number of well-spaced cases equal to the number of clusters is selected from the data. Initial cluster centers are used for a first round of classification and are then updated.

· ANOVA table:- Displays an analysis-of-variance table which includes univariate F tests for each clustering variable. The F tests are only descriptive and the resulting probabilities should not be interpreted. The ANOVA table is not displayed if all cases are assigned to a single cluster.

· Cluster information for each case:- Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case. Also displays Euclidean distance between final cluster centers.

Final Output

Saraniya Subramaniam

Group F

14047

References

http://www.uk.sagepub.com/burns/website%20material/Chapter%2023%20-%20Cluster%20Analysis.pdf

http://www.norusis.com/pdf/SPC_v13.pdf

http://www.mvsolution.com/wp-content/uploads/SPSS-Tutorial-Cluster-Analysis.pdf

http://www.uea.ac.uk/~e130/2b7ycluster.htm

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html