Hierarchical
Clustering:
Hierarchical Clustering is a type of Cluster
Analysis technique. Today we studied 2 methods of Hierarchical Clustering,
which is a 2nd level analytic technique. It is a multivariate technique, and
hence gives us more insights into the data than 1st level techniques. This is
typically used when clustering less than 50 objects.
The 2 types of Hierarchical Clustering are
Divisive and Agglomerative.
Divisive Clustering starts
with all objects as one big cluster and divides into smaller clusters, until
each object is a cluster by itself.
Agglomerative Clustering
starts with each object as a single cluster and keeps clustering according to
different criteria, till we end up with one big cluster.
Clustering Process:
1.
Selection of variables:
This depends on the objective of the analysis
2.
Distance
measurement : This is the difference between two individual objects, which
tells whether they are closer or farther apart. For example, measures like
correlation, mean, median, etc.
Distance
measurement may be interval, counts or binary.
a) In Interval, the
different methods are Euclidean distance, squared Euclidean distance,
Chebychev, Block, Minkowski, or customized. We used Euclidean distance, or a
straight line distance between two objects. That is, distance should be same
from a to b and b to a. For example, distance between city block distance, where
straight path is to be followed
b) In binary, simple matching or jaccard methods can be used.
Other methods include Euclidean distance, squared Euclidean distance, size
difference, pattern difference, variance, shape, or Lance and Williams. (In
this, we enter values for Present and Absent to specify which two values are
meaningful)
c) Counts method is usually not found in business context. It uses methods like Chi-square measure or phi-square measure. We used the Chi Square measure.
c) Counts method is usually not found in business context. It uses methods like Chi-square measure or phi-square measure. We used the Chi Square measure.
3.
Clustering
criteria:
These may be between a cluster and an object, or between two clusters.
These may be between a cluster and an object, or between two clusters.
First the variables are mapped on a graph. Then a
matrix is created that shows distances between objects. The least distance is
found from the matrix to form the first cluster. Thus the first cluster is of
the nearest objects. Say we name it Cluster A. The next step the is to make a
matrix of A with other objects. In this distance of cluster A from other
objects is measured. This is the agglomerative technique.
Clustering
Algorithms:
1.
Nearest
neighbour method (single linkage method)
In this method
the distance between two clusters is defined to be the distance between
the two closest
members, or neighbours. This method is relatively simple but is often
criticised
because it doesn’t take account of cluster structure and can result in a
problem
called chaining
whereby clusters end up being long and straggly. However, it is better
than the other
methods when the natural clusters are not spherical or elliptical in shape.
2.
Furthest neighbour method (complete linkage
method)
In this case the
distance between two clusters is defined to be the maximum distance
between members
— i.e. the distance between the two subjects that are furthest apart.
This method
tends to produce compact clusters of similar size but, as for the nearest
neighbour
method, does not take account of cluster structure. It is also quite sensitive
to outliers.
3.
Average
(between groups) linkage method (sometimes referred to as UPGMA)
The distance
between two clusters is calculated as the average distance between all pairs
of subjects in
the two clusters. This is considered to be a fairly robust method.
4.
Centroid
method
Here the
centroid (mean value for each variable) of each cluster is calculated and the
distance between
centroids is used. Clusters whose centroids are closest together are
merged. This
method is also fairly robust.
5.
Ward’s method
In this method
all possible pairs of clusters are combined and the sum of the squared
distances within
each cluster is calculated. This is then summed over all clusters. The
combination that
gives the lowest sum of squares is chosen. This method tends to
produce clusters
of approximately equal size, which is not always desirable. It is also
quite sensitive
to outliers. Despite this, it is one of the most popular methods, along
with the average
linkage method.
Dendrogram:
Dendrogram is critical
in hierarchical clustering.
When carrying
out a hierarchical cluster analysis, the process can be represented on a
diagram known as a dendrogram. This diagram illustrates which clusters have
been joined at each stage of the analysis and the distance between clusters at
the time of joining. If there is a large jump in the distance between clusters
from one stage to another then this suggests that at one stage clusters that
are relatively close together were joined whereas, at the following stage, the
clusters that were joined were relatively far apart. No. of clusters depends on
where you cutoff the dendrogram. The optimum number of clusters is the number
present just before that large jump in distance.
For example, in this
below example taken in class, if we take the cutoff line at 4, we get optimum
number of clusters, as we see that variable 4 is joining the cluster after
significant gap. Thus we will form a cluster using variables funused0,
funused5, funused1, funused7 which stand for sms, games, alarm, date&time.
Proximity
Matrix:
A square matrix in
which the entry in cell (j, k) is a measure of the similarity (or
distance) between the items to which row j and column k
correspond.
The matrix is symmetric in our case, meaning that the numbers on the lower half will be the same as the numbers in the top half. Only the lower half of a symmetric matrix is displayed. Diagonals show 1s and highest distance shows the most proximity. Those objects are clustered first.
The matrix is symmetric in our case, meaning that the numbers on the lower half will be the same as the numbers in the top half. Only the lower half of a symmetric matrix is displayed. Diagonals show 1s and highest distance shows the most proximity. Those objects are clustered first.
Proximity Matrix
Case
|
Matrix File Input
|
|||||||||
SMS |
Alarm
|
Camera
|
Scheduler
|
Music / Radio Playback
|
Games
|
Internet
|
Time and Date
|
Download
|
Other
|
|
SMS
|
1.000
|
.835
|
.310
|
.553
|
.395
|
.862
|
.211
|
.798
|
.487
|
.048
|
Alarm
|
.835
|
1.000
|
.263
|
.629
|
.342
|
.807
|
.206
|
.798
|
.455
|
.041
|
Camera
|
.310
|
.263
|
1.000
|
.243
|
.520
|
.304
|
.469
|
.249
|
.364
|
.038
|
Scheduler
|
.553
|
.629
|
.243
|
1.000
|
.388
|
.571
|
.228
|
.614
|
.486
|
.063
|
Music /
Radio Playback
|
.395
|
.342
|
.520
|
.388
|
1.000
|
.414
|
.370
|
.384
|
.417
|
.060
|
Games
|
.862
|
.807
|
.304
|
.571
|
.414
|
1.000
|
.222
|
.798
|
.511
|
.044
|
Internet
|
.211
|
.206
|
.469
|
.228
|
.370
|
.222
|
1.000
|
.209
|
.353
|
.094
|
Time and
Date
|
.798
|
.798
|
.249
|
.614
|
.384
|
.798
|
.209
|
1.000
|
.478
|
.056
|
Download
|
.487
|
.455
|
.364
|
.486
|
.417
|
.511
|
.353
|
.478
|
1.000
|
.072
|
Other
|
.048
|
.041
|
.038
|
.063
|
.060
|
.044
|
.094
|
.056
|
.072
|
1.000
|
Agglomeration
Schedule:
This table tells in
what sequence are objects added to a cluster.
Agglomeration Schedule
Stage
|
Cluster Combined
|
Coefficients
|
Stage Cluster First Appears
|
Next Stage
|
||||||
Cluster 1 |
Cluster 2
|
Cluster 1
|
Cluster 2
|
Cluster 1
|
Cluster 2
|
|||||
1
|
1
|
6
|
.862
|
0
|
0
|
2
|
||||
2
|
1
|
2
|
.821
|
1
|
0
|
3
|
||||
3
|
1
|
8
|
.798
|
2
|
0
|
4
|
||||
4
|
1
|
4
|
.592
|
3
|
0
|
6
|
||||
5
|
3
|
5
|
.520
|
0
|
0
|
7
|
||||
6
|
1
|
9
|
.484
|
4
|
0
|
8
|
||||
7
|
3
|
7
|
.419
|
5
|
0
|
8
|
||||
8
|
1
|
3
|
.306
|
6
|
7
|
9
|
||||
9
|
1
|
10
|
.057
|
8
|
0
|
0
|
||||
Agglomeration
schedule starts off using the case numbers. Once cases are added to clusters,
the cluster number is always the lowest of the case numbers in the cluster. A
cluster formed by merging cases 3 and 4 would be known as cluster 3.
The
columns labeled Stage Cluster First Appears tell the step at which each
of the two clusters that are being joined first appear. For a small data set, it
is more convenient to look at the dendrogram than trying to follow the
step-by-step clustering summarized in the agglomeration schedule. The
agglomeration schedule is more useful if you want to see the coefficient at
which clusters are combined.
An
icicle plot is a visual representation of the Agglomeration schedule
Vertical
Icicle
Number
of clusters
|
Case
|
||||||||||||||||||
Other |
Internet
|
Music / Radio Playback
|
Camera
|
Download
|
Scheduler
|
Time and Date
|
Alarm
|
Games
|
SMS
|
||||||||||
1
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
2
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
|
3
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
||
4
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
|||
5
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
||||
6
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
|||||
7
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
||||||
8
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
|||||||
9
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Finally, we saw some
basic rules to reach a better analysis:
1.
Look at data from different perspectives.
2.
Try to combine either results of different analysis
techniques or same technique performed on diff variables.
References:
Mathematics Learning
Support Center , Chapter on Cluster Analysis (http://mlsc.lboro.ac.uk/resources/statistics/Clusteranalysis.pdf)
http://www.norusis.com/pdf/SPC_v13.pdf
By-
Padmini Santi
14032
Padmini Santi
14032
No comments:
Post a Comment