Wednesday, 5 September 2012

Clustering another name for Brainstorming

Clustering  is technique used in finding a structure in a collection of unlabeled data.
A another simple definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters, the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (as defined)
Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.
It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.
Requirements
The main requirements that a clustering algorithm should satisfy are:
  • ·         scalability;
  • ·         dealing with different types of attributes;
  • ·         discovering clusters with arbitrary shape;
  • ·         minimal requirements for domain knowledge to determine input parameters;
  • ·         ability to deal with noise and outliers;
  • ·         insensitivity to order of input records;
  • ·         high dimensionality;
  • ·         interpretability and usability.


Problems
There are a number of problems with clustering. Among them:
  • ·         Current clustering techniques do not address all the requirements adequately (and concurrently);
  • ·         Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
  • ·         The effectiveness of the method depends on the definition of “distance” (for distance-based clustering);
  • ·         if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multi-dimensional spaces;
  • ·         Result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.

Clustering algorithms may be classified as listed below:

Exclusive Clustering- In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster.
Overlapping Clustering- uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value.
Hierarchical Clustering- based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted.
Probabilistic Clustering- use a completely probabilistic approach.


Research Paper by: Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore & Henry Lin
Submitted By :
Vishwanath Nishad
Roll no. 14118

No comments:

Post a Comment