Cluster Analysis

  • common name for a whole collection of computational statistical procedures
  • aim: to decompose the data into several homogeneous groups – clusters.
  • the objects inside a cluster are as similar as possible;
  • the objects from different clusters should resemble as little as possible

Definition

Let \(\mathbf X = \{\mathbf x_1, \mathbf x_2, \dotsc, \mathbf x_n \}\) be a set of objects, and some coefficient \(D\) of dissimilarity between objects. The cluster is a subset \(C \subseteq \mathbf X\) of objects such that \[\max D(\mathbf x_i, \mathbf x_j) < D(\mathbf x_k, \mathbf x_l)\] for each \(x_i, x_j, x_l \in C\) and each \(x_k \not\in C\).

  • not constructive:
    • describes the property which the cluster has to satisfy
    • but does not explain how the cluster should be constructed.
  • many clustering methods (see below)