Sunday, 9 September 2012

Session 5/6 : Group F


Hierarchical Clustering: 
Hierarchical Clustering is a type of Cluster Analysis technique. Today we studied 2 methods of Hierarchical Clustering, which is a 2nd level analytic technique. It is a multivariate technique, and hence gives us more insights into the data than 1st level techniques. This is typically used when clustering less than 50 objects. 
The 2 types of Hierarchical Clustering are Divisive and Agglomerative.
Divisive Clustering starts with all objects as one big cluster and divides into smaller clusters, until each object is a cluster by itself.
Agglomerative Clustering starts with each object as a single cluster and keeps clustering according to different criteria, till we end up with one big cluster.

      Clustering Process:
1.      Selection of variables: This depends on the objective of the analysis
2.      Distance measurement : This is the difference between two individual objects, which tells whether they are closer or farther apart. For example, measures like correlation, mean, median, etc.
Distance measurement may be interval, counts or binary.
a) In Interval, the different methods are Euclidean distance, squared Euclidean distance, Chebychev, Block, Minkowski, or customized. We used Euclidean distance, or a straight line distance between two objects. That is, distance should be same from a to b and b to a. For example, distance between city block distance, where straight path is to be followed
b) In binary, simple matching or jaccard methods can be used. Other methods include Euclidean distance, squared Euclidean distance, size difference, pattern difference, variance, shape, or Lance and Williams. (In this, we enter values for Present and Absent to specify which two values are meaningful)
c) Counts method is usually not found in business context. It uses methods like Chi-square measure or phi-square measure. We used the Chi Square measure.
3.      Clustering criteria:

      These may be between a cluster and an object, or between two clusters.
First the variables are mapped on a graph. Then a matrix is created that shows distances between objects. The least distance is found from the matrix to form the first cluster. Thus the first cluster is of the nearest objects. Say we name it Cluster A. The next step the is to make a matrix of A with other objects. In this distance of cluster A from other objects is measured. This is the agglomerative technique.


Clustering Algorithms:

1.      Nearest neighbour method (single linkage method)
In this method the distance between two clusters is defined to be the distance between
the two closest members, or neighbours. This method is relatively simple but is often
criticised because it doesn’t take account of cluster structure and can result in a problem
called chaining whereby clusters end up being long and straggly. However, it is better
than the other methods when the natural clusters are not spherical or elliptical in shape.
2.       Furthest neighbour method (complete linkage method)
In this case the distance between two clusters is defined to be the maximum distance
between members — i.e. the distance between the two subjects that are furthest apart.
This method tends to produce compact clusters of similar size but, as for the nearest
neighbour method, does not take account of cluster structure. It is also quite sensitive
to outliers.
3.      Average (between groups) linkage method (sometimes referred to as UPGMA)
The distance between two clusters is calculated as the average distance between all pairs
of subjects in the two clusters. This is considered to be a fairly robust method.
4.      Centroid method
Here the centroid (mean value for each variable) of each cluster is calculated and the
distance between centroids is used. Clusters whose centroids are closest together are
merged. This method is also fairly robust.

5.       Ward’s method
In this method all possible pairs of clusters are combined and the sum of the squared
distances within each cluster is calculated. This is then summed over all clusters. The
combination that gives the lowest sum of squares is chosen. This method tends to
produce clusters of approximately equal size, which is not always desirable. It is also
quite sensitive to outliers. Despite this, it is one of the most popular methods, along
with the average linkage method.

Dendrogram:

Dendrogram is critical in hierarchical clustering.
When carrying out a hierarchical cluster analysis, the process can be represented on a diagram known as a dendrogram. This diagram illustrates which clusters have been joined at each stage of the analysis and the distance between clusters at the time of joining. If there is a large jump in the distance between clusters from one stage to another then this suggests that at one stage clusters that are relatively close together were joined whereas, at the following stage, the clusters that were joined were relatively far apart. No. of clusters depends on where you cutoff the dendrogram. The optimum number of clusters is the number present just before that large jump in distance.
For example, in this below example taken in class, if we take the cutoff line at 4, we get optimum number of clusters, as we see that variable 4 is joining the cluster after significant gap. Thus we will form a cluster using variables funused0, funused5, funused1, funused7 which stand for sms, games, alarm, date&time.






Proximity Matrix:
A square matrix in which the entry in cell (j, k) is a measure of the similarity (or distance) between the items to which row j and column k correspond.
The matrix is symmetric in our case, meaning that the numbers on the lower half will be the same as the numbers in the top half. Only the lower half of a symmetric matrix is displayed. Diagonals show 1s and highest distance shows the most proximity. Those objects are clustered first.
                                                                                                                                                       

Proximity Matrix

Case
Matrix File Input
SMS
Alarm
Camera
Scheduler
Music / Radio Playback
Games
Internet
Time and Date
Download
Other
SMS
1.000
.835
.310
.553
.395
.862
.211
.798
.487
.048
Alarm
.835
1.000
.263
.629
.342
.807
.206
.798
.455
.041
Camera
.310
.263
1.000
.243
.520
.304
.469
.249
.364
.038
Scheduler
.553
.629
.243
1.000
.388
.571
.228
.614
.486
.063
Music / Radio Playback
.395
.342
.520
.388
1.000
.414
.370
.384
.417
.060
Games
.862
.807
.304
.571
.414
1.000
.222
.798
.511
.044
Internet
.211
.206
.469
.228
.370
.222
1.000
.209
.353
.094
Time and Date
.798
.798
.249
.614
.384
.798
.209
1.000
.478
.056
Download
.487
.455
.364
.486
.417
.511
.353
.478
1.000
.072
Other
.048
.041
.038
.063
.060
.044
.094
.056
.072
1.000
 


Agglomeration Schedule:
This table tells in what sequence are objects added to a cluster.
Agglomeration Schedule
Stage
Cluster Combined
Coefficients
Stage Cluster First Appears
Next Stage
Cluster 1
Cluster 2
Cluster 1
Cluster 2
Cluster 1
Cluster 2

1
1
6
.862
0
0
2
2
1
2
.821
1
0
3
3
1
8
.798
2
0
4
4
1
4
.592
3
0
6
5
3
5
.520
0
0
7
6
1
9
.484
4
0
8
7
3
7
.419
5
0
8
8
1
3
.306
6
7
9
9
1
10
.057
8
0
0











 

Agglomeration schedule starts off using the case numbers. Once cases are added to clusters, the cluster number is always the lowest of the case numbers in the cluster. A cluster formed by merging cases 3 and 4 would be known as cluster 3.
The columns labeled Stage Cluster First Appears tell the step at which each of the two clusters that are being joined first appear. For a small data set, it is more convenient to look at the dendrogram than trying to follow the step-by-step clustering summarized in the agglomeration schedule. The agglomeration schedule is more useful if you want to see the coefficient at which clusters are combined.
An icicle plot is a visual representation of the Agglomeration schedule

                                                                                     


                                                                                                      Vertical Icicle
Number of clusters
Case
Other

Internet

Music / Radio Playback

Camera

Download

Scheduler

Time and Date

Alarm

Games

SMS
1
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
2
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
3
X

X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
4
X

X

X
X
X

X
X
X
X
X
X
X
X
X
X
X
5
X

X

X
X
X

X

X
X
X
X
X
X
X
X
X
6
X

X

X

X

X

X
X
X
X
X
X
X
X
X
7
X

X

X

X

X

X

X
X
X
X
X
X
X
8
X

X

X

X

X

X

X

X
X
X
X
X
9
X

X

X

X

X

X

X

X

X
X
X
 



Finally, we saw some basic rules to reach a better analysis:
1.      Look at data from different perspectives.
2.      Try to combine either results of different analysis techniques or same technique performed on diff variables.

References:
Mathematics Learning Support Center , Chapter on Cluster Analysis (http://mlsc.lboro.ac.uk/resources/statistics/Clusteranalysis.pdf)
http://www.norusis.com/pdf/SPC_v13.pdf


By-
Padmini Santi
14032


No comments:

Post a Comment