SIBM B- Business Analytics: Session 5/6 : Group F

Hierarchical Clustering:

Hierarchical Clustering is a type of Cluster Analysis technique. Today we studied 2 methods of Hierarchical Clustering, which is a 2nd level analytic technique. It is a multivariate technique, and hence gives us more insights into the data than 1st level techniques. This is typically used when clustering less than 50 objects.

The 2 types of Hierarchical Clustering are Divisive and Agglomerative.

Divisive Clustering starts with all objects as one big cluster and divides into smaller clusters, until each object is a cluster by itself.

Agglomerative Clustering starts with each object as a single cluster and keeps clustering according to different criteria, till we end up with one big cluster.

Clustering Process:

1. Selection of variables: This depends on the objective of the analysis

2. Distance measurement : This is the difference between two individual objects, which tells whether they are closer or farther apart. For example, measures like correlation, mean, median, etc.

Distance measurement may be interval, counts or binary.

a) In Interval, the different methods are Euclidean distance, squared Euclidean distance, Chebychev, Block, Minkowski, or customized. We used Euclidean distance, or a straight line distance between two objects. That is, distance should be same from a to b and b to a. For example, distance between city block distance, where straight path is to be followed

b) In binary, simple matching or jaccard methods can be used. Other methods include Euclidean distance, squared Euclidean distance, size difference, pattern difference, variance, shape, or Lance and Williams. (In this, we enter values for Present and Absent to specify which two values are meaningful)
c) Counts method is usually not found in business context. It uses methods like Chi-square measure or phi-square measure. We used the Chi Square measure.

3. Clustering criteria:

These may be between a cluster and an object, or between two clusters.

First the variables are mapped on a graph. Then a matrix is created that shows distances between objects. The least distance is found from the matrix to form the first cluster. Thus the first cluster is of the nearest objects. Say we name it Cluster A. The next step the is to make a matrix of A with other objects. In this distance of cluster A from other objects is measured. This is the agglomerative technique.

Clustering Algorithms:

1. Nearest neighbour method (single linkage method)

In this method the distance between two clusters is deﬁned to be the distance between

the two closest members, or neighbours. This method is relatively simple but is often

criticised because it doesn’t take account of cluster structure and can result in a problem

called chaining whereby clusters end up being long and straggly. However, it is better

than the other methods when the natural clusters are not spherical or elliptical in shape.

2. Furthest neighbour method (complete linkage method)

In this case the distance between two clusters is deﬁned to be the maximum distance

between members — i.e. the distance between the two subjects that are furthest apart.

This method tends to produce compact clusters of similar size but, as for the nearest

neighbour method, does not take account of cluster structure. It is also quite sensitive

to outliers.

3. Average (between groups) linkage method (sometimes referred to as UPGMA)

The distance between two clusters is calculated as the average distance between all pairs

of subjects in the two clusters. This is considered to be a fairly robust method.

4. Centroid method

Here the centroid (mean value for each variable) of each cluster is calculated and the

distance between centroids is used. Clusters whose centroids are closest together are

merged. This method is also fairly robust.

5. Ward’s method

In this method all possible pairs of clusters are combined and the sum of the squared

distances within each cluster is calculated. This is then summed over all clusters. The

combination that gives the lowest sum of squares is chosen. This method tends to

produce clusters of approximately equal size, which is not always desirable. It is also

quite sensitive to outliers. Despite this, it is one of the most popular methods, along

with the average linkage method.

Dendrogram:

Dendrogram is critical in hierarchical clustering.

When carrying out a hierarchical cluster analysis, the process can be represented on a diagram known as a dendrogram. This diagram illustrates which clusters have been joined at each stage of the analysis and the distance between clusters at the time of joining. If there is a large jump in the distance between clusters from one stage to another then this suggests that at one stage clusters that are relatively close together were joined whereas, at the following stage, the clusters that were joined were relatively far apart. No. of clusters depends on where you cutoff the dendrogram. The optimum number of clusters is the number present just before that large jump in distance.

For example, in this below example taken in class, if we take the cutoff line at 4, we get optimum number of clusters, as we see that variable 4 is joining the cluster after significant gap. Thus we will form a cluster using variables funused0, funused5, funused1, funused7 which stand for sms, games, alarm, date&time.

Proximity Matrix:

A square matrix in which the entry in cell (j, k) is a measure of the similarity (or distance) between the items to which row j and column k correspond.
The matrix is symmetric in our case, meaning that the numbers on the lower half will be the same as the numbers in the top half. Only the lower half of a symmetric matrix is displayed. Diagonals show 1s and highest distance shows the most proximity. Those objects are clustered first.

Proximity Matrix

Case	Matrix File Input
Case	SMS	Alarm	Camera	Scheduler	Music / Radio Playback	Games	Internet	Time and Date	Download	Other
SMS	1.000	.835	.310	.553	.395	.862	.211	.798	.487	.048
Alarm	.835	1.000	.263	.629	.342	.807	.206	.798	.455	.041
Camera	.310	.263	1.000	.243	.520	.304	.469	.249	.364	.038
Scheduler	.553	.629	.243	1.000	.388	.571	.228	.614	.486	.063
Music / Radio Playback	.395	.342	.520	.388	1.000	.414	.370	.384	.417	.060
Games	.862	.807	.304	.571	.414	1.000	.222	.798	.511	.044
Internet	.211	.206	.469	.228	.370	.222	1.000	.209	.353	.094
Time and Date	.798	.798	.249	.614	.384	.798	.209	1.000	.478	.056
Download	.487	.455	.364	.486	.417	.511	.353	.478	1.000	.072
Other	.048	.041	.038	.063	.060	.044	.094	.056	.072	1.000

Agglomeration Schedule:

This table tells in what sequence are objects added to a cluster.

Agglomeration Schedule

Stage	Cluster Combined		Coefficients		Stage Cluster First Appears				Next Stage
	Cluster 1	Cluster 2	Cluster 1	Cluster 2		Cluster 1		Cluster 2
1	1	6	.862		0		0		2
2	1	2	.821		1		0		3
3	1	8	.798		2		0		4
4	1	4	.592		3		0		6
5	3	5	.520		0		0		7
6	1	9	.484		4		0		8
7	3	7	.419		5		0		8
8	1	3	.306		6		7		9
9	1	10	.057		8		0		0

Agglomeration schedule starts off using the case numbers. Once cases are added to clusters, the cluster number is always the lowest of the case numbers in the cluster. A cluster formed by merging cases 3 and 4 would be known as cluster 3.

The columns labeled Stage Cluster First Appears tell the step at which each of the two clusters that are being joined first appear. For a small data set, it is more convenient to look at the dendrogram than trying to follow the step-by-step clustering summarized in the agglomeration schedule. The agglomeration schedule is more useful if you want to see the coefficient at which clusters are combined.

An icicle plot is a visual representation of the Agglomeration schedule

Vertical Icicle

Number of clusters

Case

Other

Internet

Music / Radio Playback

Camera

Download

Scheduler

Time and Date

Alarm

Games

SMS

Finally, we saw some basic rules to reach a better analysis:

1. Look at data from different perspectives.

2. Try to combine either results of different analysis techniques or same technique performed on diff variables.

References:

http://pic.dhe.ibm.com/infocenter/spssstat/v21r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_prxd_meas.htm

Mathematics Learning Support Center , Chapter on Cluster Analysis (http://mlsc.lboro.ac.uk/resources/statistics/Clusteranalysis.pdf)

http://www.norusis.com/pdf/SPC_v13.pdf

By-
Padmini Santi
14032

SIBM B- Business Analytics

Sunday, 9 September 2012

Session 5/6 : Group F

No comments:

Post a Comment

About Me