Tuesday, 11 September 2012

Session 7 & 8 - Group B (Ashim A Ekka)

Group B
Written by Ashim Abhinav Ekka (14133)

Cluster Analysis: Also called as clustering, is the task of assigning a set of cases into clusters so that the cases in the same cluster are more similar to each other than to those in other clusters.

Here we see 4 different clusters in which one cluster will be different to other in some aspect



The objective of clustering is to determine the intrinsic grouping in a set of unlabeled data in large numbers of variables and observation.

Hierarchical clustering: A method of cluster analysis which seeks to build a hierarchy of clusters. It is most appropriate for small samples. When the sample is large, the algorithm may be very slow to reach a solution. In general, users should consider K-Means Cluster when the sample size is larger than 200. It is of two types-

·         Agglomerative (bottom-up approach): every observation start in its own cluster and the pairs of clusters are merged as one moves up the hierarchy.
·         Divisive (top-down approach): all observations start in only one cluster and splits are performed recursively as one moves down the hierarchy

The results of hierarchical clustering can be presented in a dendrogram.

Distance measure of hierarchical clusters:

·         Interval – Euclidean
·         Counts – Chi Sq
·         Binary - Jaccard

Different cluster methods:

·         Nearest neighbour: In this method, the distance between two clusters is taken to be the distance between their closest neighbouring objects. This method is recommended if plotted clusters are elongated.
·         Furthest neighbour: In this method, the distance between two clusters is the maximum distance between two objects in different clusters. This method is recommended if the plotted clusters form distinct clumps (not elongated chains).
·         Group average: In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the different clusters. This method is usually recommended as it makes use of more information.
·         Centroid: The cluster to be merged is the one with the smallest sum of distances between the centroid for all variables. The centroid of a cluster is the average point in the multidimensional space.
·         Median: This method is identical to the Centroid method but is unweighted. It should not be used when cluster sizes vary markedly. 

Example of hierarchical clustering:

In Marketing: finding groups of customers with similar behaviour given a large database of customer data containing their properties and past buying records

Reference:

http://www.wikipedia.org/
http://www.originlab.com

No comments:

Post a Comment