SIBM B- Business Analytics: Group A Session 4

Cluster Analysis

Today we came to know about the concept of Cluster Analysis and its various applications.

Clustering or segmenting refers to methods for grouping objects of similar kind into respective categories.
It is one of the most useful tasks in data mining process for discovering groups and
identifying interesting distributions and patterns in the underlying data. Clustering problem
is about partitioning a given data set into groups (clusters) such that the data points in a
cluster are more similar to each other than points in different clusters.
For example, consider a retail database records containing items purchased by customers.
A clustering procedure could group the customers in such a way that customers with similar
buying patterns are in the same cluster. Thus, the main concern in the clustering process
is to reveal the organization of patterns into “sensible” groups, which allow us to discover
similarities and differences, as well as to derive useful conclusions about them. This idea is
applicable in many fields, such as life sciences, medical sciences and engineering.

Two commonly used clustering methods are - Hierarchical & K-means clustering.

Hierarchical Clustering

Connectivity based clustering, also known as hierarchical clustering, is based on the idea of objects being more related to nearby objects than
to objects farther away. As such, these algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely
by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using
a dendrogram. These algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge
with each other at certain distances. Mostly this method of clustering is used when there are less than 50 objects.

K-means Clustering

In k-means clustering, also known as centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set.
When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects
to the nearest cluster center, such that the squared distances from the cluster are minimized.

Most k-means-type algorithms require the number of clusters - k - to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms.
Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut
borders in between of clusters (which is not surprising, as the algorithm optimized cluster centers, not cluster borders).
Mostly this method of clustering is used when there are more than 50 objects.

Hierarchical Clustering is further categorized into two : -

Agglomerative & Divisive Clustering

Agglomerative Hierarchical clustering method works on the bottom-up approach.
In Agglomerative hierarchical method, each object creates its own clusters. The single
Clusters are merged to make larger clusters and the process of merging continues until all
the singular clusters are merged into one big cluster that consists of all the objects.

Divisive Hierarchical clustering method works on the top-down approach. In this
method all the objects are arranged within a big singular cluster and the large cluster is
continuously divided into smaller clusters until each cluster has a single object.

Clustering Process
It is defined by the answer to the question "Why am I forming the groups?"

Following are the steps involved in the clustering process:

Selection of variables
Distance Measurement
Clustering criteria
Mapping
Distance matrix
Dendogram

Applications of Clustering include the following :

Data reduction
Hypothesis generation
Hypothesis testing
Prediction based on groups

Mosre specifically, Clustering finds its applications in the following areas :

Business
Biology
Spatial data analysis
Web Mining

Supriya Suman (14171)
Operations, Group A

SIBM B- Business Analytics

Wednesday, 5 September 2012

Group A Session 4

Cluster Analysis

No comments:

Post a Comment

About Me