Wednesday, 5 September 2012

Group B, Session 5/6


Today we learned about Clustering Analysis, its application and OLAP cubes. We started with the “Mobile Services” example which had data about the mobile usage among customers. Cluster analysis classifies a set of observations into two or more mutually exclusive unknown groups based on combinations of interval variables. The purpose of cluster analysis is to discover a system of organizing observations, usually people, into groups. where members of the groups share properties in common.
There are two types of cluster analysis – Hierarchical, and K means. We use the hierarchical cluster analysis if the number of data points is less than 50.  If it is greater than 50 we use the K means clustering. Generally we use hierarchical clustering for the variables and K means for the cases.
The basic Hierarchical Clustering algorithm is
1.      Compute the Proximity Matrix
2.      Repeat
a.      Merge the closest two clusters
b.      Update the proximity matrix to reflect the proximity between the new cluster and the original clusters
3.      Until Only one cluster remains.
Dendogram
We learned how to interpret the Dendogram. A dendogram that clearly differentiates groups of objects will have small distances in the far branches of the tree and large differences in the near branches.
Description: http://www.psychstat.missouristate.edu/multibook/Images/mlt0406.gif
We draw a line on the dendogram to determine the clusters. Statistically the line is drawn when the distance between the cluster formation suddenly increases.  Dendogram along with the proximity matrix provides us with enough data to identify the different clusters.
Distance Measure
The interpretation of the clustering will vary upon the measure we have chosen. An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading.
If it’s an interval measure, Euclidean Distance is the simplest of the measures we can use for clustering. Euclidean Distance is nothing but the length of the straight line drawn between two points. But there are many other measures like Minkowski, or Chebychev etc.
For binary data, the Jaccard measure was introduced in class. Jaccard distance is the ratio of the number of people who said yes for both the variables to the total number of people.

OLAP Cubes
We were also introduced to OLAP Cubes. online analytical processing, or OLAP , is an approach to swiftly answer multi-dimensional analytical queries. OLAP tools enable users to interactively analyze multidimensional data from multiple perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing.
Slice is a subset of a multi-dimensional array corresponding to a single value for one or more members of the dimensions not in the subset. The dice operation is a slice on more than two dimensions of a data cube (or more than two consecutive slices). Drilling down or up is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down). (Aggregate, Consolidate) A roll-up involves computing all of the data relationships for one or more dimensions. To do this, a computational relationship or formula might be defined.
Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions. For example, all sales offices are rolled up to the sales department or sales division to anticipate sales trends. In contrast, the drill-down is a technique that allows users to navigate through the details. For instance, users can view the sales by individual products that make up a region’s sales. Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the OLAP cube and view (dicing) the slices from different viewpoints.

Group B
Author: Manu Joseph

No comments:

Post a Comment