Applied Multivariate Statistical Analysis Chapter 12 - Clustering, Distance Methods, and Ordination
by Arpon Sarker
Introduction
This chapter is more about exploratory procedures. The difference between clustering and classification is that for classifying object, there is already a known number of groups and some group structure assumption that isn’t present for this chapter. Clustering or grouping is based on similarities through distance methods.
Similarity Measures
We can use Euclidean distance which was defined in chapter 1 which is part of the Minkowski metric or other popular measures being the Canberra metric or Czekanowski coefficient. We can also use binary variables for representing the presence or absence of some characteristic if the items cannot be represented by some $p$-dimensional measurements. However, if we wanted to weight the distances between some items which are not equally weighted since the presence of a characteristic in two items is much stronger indication of something than the absence, we can weight the frequencies of matches and mismatches in a contingency table. For example, \(\frac{a+d}{p} \quad \text{equal weights for 1-1 and 0-0 matches} \\ \frac{2a}{2a+b+c} \quad \text{no 0-0 matches and double weight for 1-1 matches}\)
For comparing between the similarities of two variables, rather than items, we can use the product moment correlation, where the correlation coefficient is related to the $\chi_p^2$-distribution.
Hierarchical Clustering Methods
Hierarchical clustering based on a series of successive divisions (divisive hierarchical methods) or successive mergers (agglomerative hierarchical methods). This chapter just focuses on the agglomerative method. The algorithm for grouping $N$ objects is all the same, regardless of the linkage methods chosen. Single linkage just takes the minimum/nearest point as the distance between the two clusters, complete linkage is the maximum distance, and average linkage is just the average distance.
- Start with N clusters and a symmetric matrix of distances $D$
- Search distance matrix for nearest pair of clusters.
- Merge clusters and update entries in distance matrix
- Repeat steps 2 and 3 for $N-1$ times and record the identity of clusters that are merged and the levels the mergers take place
Single linkage is useful for nonelliptical configurations, where it has a tendency of chaining which picks out long stringlike clusters. However, it cannot cluster elliptical configurations that are quite close to each other or near overlap and has difficulty in distinguishing these as two closely knit clusters.
Ward’s hierarchical clustering method is based on minimising the ‘loss of information’ from joining two groups which is based on the error sum of squares criterion or ESS, which is just the sum of squared deviations of every item in the cluster from the cluster mean (centroid).
Nonhierarchical Clustering Methods
Nonhierarchical clustering methods either start as an initial partition of items into groups or an initial set of seed points. For the $K$-means method, you can either partition the items into some $K$ initial clusters or we could specify $K$ initial centroids.
- Partition items into $K$ clusters
- Proceed through items, assigning an item to the cluster whose centroid is nearest. Recalculate the centroid once its cluster assignments have changed.
- Repeat step 2 until no more reassignments take place
You can also use F-ratios to compare between the between-cluster variability and within-cluster variability.
Clustering based on Statistical Methods
For $K$ clusters, the observation vector for a single object is modelled from the mixing distribution, \(f_{\text{Mix}}(x) = \sum_{k=1}^K p_k f_k(x)\) , where $p_k$ is the expected proportion of the objects and $f_k$ is the probability density function. Inferences are based on the likelihood and using the maximum likelihood estimates which can be obtained numerically using applications software. To compare models with different numbers of parameters we use, \(-2 \ln L_{\text{max}} - \text{Penalty}\) where the penalty is based on the number of parameters estimated and the number of observations $N$. Either the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) can be used as penalties while limiting the number of parameters.
Multidimensional Scaling
This is used for displaying (transformed) multivariate data in a low-dimensional space while still keeping the order of similarities or distances from largest to smallest. This is called an ordination of the data. The numerical measure of closeness used is stress to compare configurations. Metric dimensional scaling uses the actual magnitudes of the original similarities and nonmetric dimensional scaling uses only the ordinal information or the rank order to obtain a geometrical representation.
Correspondence Analysis
This is a graphical procedure for representing associations in a table of frequencies or counts. The inertia is also outputted, which is the amount of information retained in each dimension.
Biplots for Viewing Sampling Units and Variables
Biplots is just adding information about the variables to the principal component graph. This just shows up as vectors for each variable overlapped or on top of a principal component graph of 2 dimensions, relating both the items and the variables that affect the items (which are much closer to some vectors over others). An alternative biplot can be used where each vector has scaled axes.
Procrustes Analysis
This is a method for comparing configurations where $X^$ contains the coordinates of plotting with technique 1 and $Y^$ contains the coordinates for technique 2. If one of the vectors has a smaller dimension then just pad its columns with 0. To see how compatible these configurations are, shift and rotate/reflect the configuration about the coordinate axes using an orthogonal matrix $Q$. \(Qy_j + b\)
We now want to minimise the sum over all $n$ points of squared distances $d_j^2(x_j, Qy_j+b)$ which are between both configurations. We must determine the Procrustes rotation to be able to minimise the residual sum of squares.
Data Mining
This refers to the process of discovering patterns and relationships in extremely large data sets. It doesn’t allow any formal inference and can classify, predict, association analysis, cluster, or describe.
The process is described below:
- Define the problem and identify objectives
- Gather and prepare appropriate data
- Explore data for suspected associations, unanticipated characteristics, and obvious anomalies to gain understanding
- Clean the data and perform any variable transformation, if deemed appropriate
- Divide data into training, validation, and data sets
- Build the model on the training set
- Modify the model based on performance with validation data
- Assess model by checking performance on validation or test data
- Use the model
- Monitor model performance - are the results reliable, cost effective?