Ankur Goel (Group D)
Different Types of Clustering Techniques:
Connectivity based clustering (hierarchical clustering)
Connectivity based clustering, also known as hierarchical
clustering, is based on the core idea of objects being more related to nearby
objects than to objects farther away. As such, these algorithms connect
"objects" to form "clusters" based on their distance. A
cluster can be described largely by the maximum distance needed to connect
parts of the cluster. At different distances, different clusters will form,
which can be represented using a dendrogram,
which explains where the common name "hierarchical clustering" comes
from: these algorithms do not provide a single partitioning of the data set,
but instead provide an extensive hierarchy of clusters that merge with each
other at certain distances. In a dendrogram, the y-axis marks the distance at
which the clusters merge, while the objects are placed along the x-axis such
that the clusters don't mix.
K –means Clustering
In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean.
This results in a partitioning of the data space into Voronoi cells.
The
problem is computationally difficult (NP-hard), however there are efficient heuristic algorithms that are commonly employed and
converge fast to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian
distributions via an iterative
refinement approach employed by both algorithms. Additionally, they both use
cluster centers to model the data, however k-means
clustering tends to find clusters of comparable spatial extent, while the
expectation-maximization mechanism allows clusters to have different shapes.
Mean shift clustering
Basic mean shift clustering algorithms maintain a set of data points the same size
as the input data set. Initially, this set is copied from the input set. Then
this set is iteratively replaced by the mean of those points in the set that
are within a given distance of that point. By contrast, k-means restricts this updated set to k points usually much less
than the number of points in the input data set, and replaces each point in
this set by the mean of all points in the input set that are closer to that
point than any other (e.g. within the Voronoi partition of each updating
point). A mean shift algorithm that is similar then to k-means, called likelihood mean shift, replaces the set of points undergoing
replacement by the mean of all points in the input set that are within a given
distance of the changing set.One of the advantages of mean shift over k-means is that there is no need to choose the number of clusters,
because mean shift is likely to find only a few clusters if indeed only a small
number exist. However, mean shift can be much slower than k-means, and still requires selection of a bandwidth parameter.
Mean shift has soft variants much as k-means does.
Consensus clustering
It has emerged as an important elaboration of the classical
clustering problem. Consensus clustering, also called aggregation of clustering
(or partitions), refers to the situation in which a number of different (input)
clusterings have been obtained for a particular dataset and it is desired to
find a single (consensus) clustering which is a better fit in some sense than
the existing clusterings. Consensus clustering is thus the problem of
reconciling clustering information about the same data set coming from
different sources or from different runs of the same algorithm. When cast as an
optimization problem, consensus clustering is known as median partition, and
has been shown to be NP-complete. Consensus clustering for unsupervised learning is
analogous to ensemble
learning in supervised learning.
Distribution-based
clustering
The
clustering model most closely related to statistics is based on distribution models. Clusters can then
easily be defined as objects belonging most likely to the same distribution. A
nice property of this approach is that this closely resembles the way
artificial data sets are generated: by sampling random objects from a
distribution.
While
the theoretical foundation of these methods is excellent, they suffer from one
key problem known as overfitting,
unless constraints are put on the model complexity. A more complex model will
usually always be able to explain the data better, which makes choosing the
appropriate model complexity inherently difficult.
Density-based clustering
In
density-based clustering,clusters are defined as areas of higher density than
the remainder of the data set. Objects in these sparse areas - that are
required to separate clusters - are usually considered to be noise and border
points.
The
most popular density based
clustering method is DBSCAN. In
contrast to many newer methods, it features a well-defined cluster model called
"density-reachability". Similar to linkage based clustering, it is
based on connecting points within certain distance thresholds. However, it only
connects points that satisfy a density criterion, in the original variant
defined as a minimum number of other objects within this radius. A cluster
consists of all density-connected objects (which can form a cluster of an
arbitrary shape, in contrast to many other methods) plus all objects that are
within these objects' range. Another interesting property of DBSCAN is that its
complexity is fairly low - it requires a linear number of range queries on the
database - and that it will discover essentially the same results (it is deterministic for core and noise points, but not for
border points) in each run, therefore there is no need to run it multiple
times.
No comments:
Post a Comment