Today we learned about Clustering Analysis, its application
and OLAP cubes. We started with the “Mobile Services” example which had data
about the mobile usage among customers. Cluster analysis classifies a set of
observations into two or more mutually exclusive unknown groups based on
combinations of interval variables. The purpose of cluster analysis is to
discover a system of organizing observations, usually people, into groups.
where members of the groups share properties in common.
There are two types of cluster analysis – Hierarchical, and
K means. We use the hierarchical cluster analysis if the number of data points
is less than 50. If it is greater than
50 we use the K means clustering. Generally we use hierarchical clustering for
the variables and K means for the cases.
The basic Hierarchical Clustering algorithm is
1.
Compute the Proximity Matrix
2.
Repeat
a.
Merge the closest two clusters
b.
Update the proximity matrix to reflect the
proximity between the new cluster and the original clusters
3.
Until Only one cluster remains.
Dendogram
We learned how to interpret the
Dendogram. A dendogram that clearly differentiates groups of objects will have
small distances in the far branches of the tree and large differences in the
near branches.
We draw a line on the dendogram to
determine the clusters. Statistically the line is drawn when the distance between
the cluster formation suddenly increases.
Dendogram along with the proximity matrix provides us with enough data
to identify the different clusters.
Distance Measure
The interpretation of the
clustering will vary upon the measure we have chosen. An important component of
a clustering algorithm is the distance measure between data points. If the
components of the data instance vectors are all in the same physical units then
it is possible that the simple Euclidean distance metric is sufficient to
successfully group similar data instances. However, even in this case the
Euclidean distance can sometimes be misleading.
If it’s an interval measure,
Euclidean Distance is the simplest of the measures we can use for clustering. Euclidean
Distance is nothing but the length of the straight line drawn between two
points. But there are many other measures like Minkowski, or Chebychev etc.
For binary data, the Jaccard
measure was introduced in class. Jaccard distance is the ratio of the number of
people who said yes for both the variables to the total number of people.
OLAP Cubes
We were also introduced to OLAP
Cubes. online analytical processing, or OLAP , is an approach to swiftly answer
multi-dimensional analytical queries. OLAP tools enable users to interactively
analyze multidimensional data from multiple perspectives. OLAP consists of
three basic analytical operations: consolidation (roll-up), drill-down, and
slicing and dicing.
Slice is a subset of a
multi-dimensional array corresponding to a single value for one or more members
of the dimensions not in the subset. The dice operation is a slice on more than
two dimensions of a data cube (or more than two consecutive slices). Drilling
down or up is a specific analytical technique whereby the user navigates among
levels of data ranging from the most summarized (up) to the most detailed
(down). (Aggregate, Consolidate) A roll-up involves computing all of the data
relationships for one or more dimensions. To do this, a computational
relationship or formula might be defined.
Consolidation involves the
aggregation of data that can be accumulated and computed in one or more
dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends. In contrast, the
drill-down is a technique that allows users to navigate through the details.
For instance, users can view the sales by individual products that make up a
region’s sales. Slicing and dicing is a feature whereby users can take out
(slicing) a specific set of data of the OLAP cube and view (dicing) the slices
from different viewpoints.
Group B
Author: Manu Joseph
No comments:
Post a Comment