K
Means Clustering
SPSS has three
different procedures to cluster data:- hierarchical cluster analysis, K – means
cluster and two step cluster.
K-means clustering was
originally designed as a method that allowed very large data sets to be
clustered in a feasible amount of time, when computers were rather slower than
they are today. This explains its other name of "quick clustering".
K- means method of
clustering is very different from the hierarchical clustering and Ward method,
which are applied when there is no prior knowledge of how many clusters there may
be or what they are characterized by. K-means clustering is used when you
already have hypotheses concerning the number of clusters in your cases or
variables. K means is used when you know how many clusters you want and you
have a moderately sized data set (greater than 50 objects).
It differs from
hierarchical clustering in several ways. You have to know in advance the number
of clusters you want. You can’t get solutions for a range of cluster numbers
unless you rerun the analysis for each different number of clusters. The
algorithm repeatedly reassigns cases to clusters, so the same case can move
from cluster to cluster during the analysis. In agglomerative hierarchical
clustering, on the other hand, cases are added only to existing clusters. The
algorithm is called k-means, where k is the number of clusters you want, since
a case is assigned to the cluster for which its distance to the cluster mean is
the smallest. The action in the algorithm centers on finding the k-means.
If the sample is large
enough, it can be split in half with clustering performed on each and the
results compared.
Initial
Cluster Centers
The first step in
k-means clustering is finding the k centers. This is done iteratively. You start
with an initial set of centers and then modify them until the change between two
iterations is small enough. K-means clustering is very sensitive to outliers,
since they will usually be selected as initial cluster centers. This will
result in outliers forming clusters with small numbers of cases. Before you
start a cluster analysis, screen the data for outliers and remove them from the
initial analysis. The solution may also depend on the order of the cases in the
file.
After the
initial cluster centers have been selected, each case is assigned to the
closest cluster, based on its distance from the cluster centers. After all of
the cases have been assigned to clusters, the cluster centers are recomputed,
based on all of the cases in the cluster. Case assignment is done again, using
these updated cluster centers. You keep assigning cases and re computing the
cluster centers until no cluster center changes.
Final
Cluster Centers
After iteration
stops, all cases are assigned to clusters, based on the last set of cluster
centers. After
all of the cases are clustered, the cluster centers are computed one last
time. Using the
final cluster centers, you can describe the clusters.
Differences
between Clusters
By finding the
differences between the clusters, the clusters can be defined and the profiling
of each cluster can be defined. And according to this the company can define
the strategies.
The main
disadvantage is that there needs to be a certain amount of trial and error in
choosing the number of clusters.
K-
means Cluster Analysis Options
Statistics:- You
can select the following statistics: initial cluster centers, ANOVA table, and
cluster information for each case.
·
Initial cluster centers:- First
estimate of the variable means for each of the clusters. By default, a number
of well-spaced cases equal to the number of clusters is selected from the data.
Initial cluster centers are used for a first round of classification and are
then updated.
·
ANOVA
table:- Displays an analysis-of-variance table which
includes univariate F tests for each clustering variable. The F tests are only
descriptive and the resulting probabilities should not be interpreted. The
ANOVA table is not displayed if all cases are assigned to a single cluster.
·
Cluster
information for each case:- Displays for each case the final
cluster assignment and the Euclidean distance between the case and the cluster
center used to classify the case. Also displays Euclidean distance between
final cluster centers.
Final
Output
By
Saraniya Subramaniam
Group F
14047
HR
References
No comments:
Post a Comment