Wednesday, 5 September 2012

Group B, Day 3, Session 5-6


Today we learnt and used clustering methods and used various distances:

There are two types of clustering techniques: Hierarchical Clustering and K-means Clustering.
Hierarchical clustering requires a distance or similarity matrix between all pairs of cases. That’s a humongous matrix if we have tens of thousands of cases trapped in our data file. Even today’s computers will take pause, as will we, waiting for results. A clustering method that doesn’t require computation of all possible distances is k-means clustering. It differs from hierarchical clustering in several ways. We have to know in advance the number of clusters we want. We can’t get solutions for a range of cluster numbers unless we rerun the analysis for each different number of clusters. The algorithm repeatedly reassigns cases to clusters, so the same case can move from cluster to cluster during the analysis.
The algorithm is called k-means, where k is the number of clusters we want, since a case is assigned to the cluster for which its distance to the cluster mean is the smallest. The action in the algorithm centers around finding the k-means.

The next segment will be about the Jaccard Distance which we used today:

The Jaccard Distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1.


 J_{\delta}(A,B) = 1 - J(A,B) = { { |A \cup B| - |A \cap B| } \over |A \cup B| }.

where, J(A,B) is the Jaccard Coefficient and A and B are two different Sample Sets.

Team B
Author: Vivek Agarwal

No comments:

Post a Comment