13 January 2025

Applied Multivariate Statistical Analysis Chapter 10 - Canonical Correlation Analysis

by Arpon Sarker

Introduction

Canonical correlation analysis seeks to identify and quantify the associations between two sets of variables. This focuses on comparing one linear combination of variables as a set with another linear combination of variables as a set.

Determine the pair of linear combinations with the highest correlation among all pairs
Determine the next best pair of linear combinations having the highest correlation but is uncorrelated with the previously selected pairs

The pairs of linear combinations are called canonical variables and their correlations are called canonical correlations.

Canonical Variates and Canonical Correlations

We split up two sets of variables $X^{(1)}$ and $X^{(2)}$ and we partition them as we have done previously to get the overall $\mu$ and $\Sigma$. The main task of canonical correlation analysis is to summarise the associations between $X^{(1)}$ and $X^{(2)}$ sets in terms of a few carefully chosen covariances/correlations rather than having to analyse the tremendous amount of $pq$ values in $\Sigma_{12}$. A further restriction is that the pari of linear combinations $U_i$ and $V_i$ for the ith pair of canonical variables must have unit variances.

To maximise the correlation between these two, we use their eigenvalues and eigenvectors where $\max_{a,b} Corr(U_k,V_k) = \rho_1* \\ U_k = e_k^T \Sigma_{11}^{-1/2}X^{(1)} \quad V_k = f_k^T \Sigma_{22}^{-1/2}X^{(2)}$ Both the bottom equations maximise the first where the $e$ and $f$ are the eigenvectors of some long expression of partitioned covariance matrices and $\rho$ is the associated eigenvalues. Standardising the original variables into $Z$ not $X$, givest the same value but unlike PCA, we can infer some values as being from $\Sigma$ and $\rho$ based on standardisation.

Interpreting the Population Canonical Variables

Even though these canonical variables are artificial, they can be identified in terms of the subject-matter variables. The correlations, however only give univariate information since they do not indicate how the original variables contribute jointly to the canonical analyses. Here we use matrices to be able to use linear combinations for the entire population $U = AX^{(1)} \quad V = BX^{(2)}$

The canonical correlations can also act as multiple correlation coefficients and thus the kth squared canonical correlation $\rho_k^2$ is the proportion of variance of canonical variate $U_k$ explained by the set $X^{(2)}$.

The Sample Canonical Variates and Sample Canonical Correlations

This also partitions $X$ into $\begin{bmatrix} X^{(1)} \vdots X^{(2)} \end{bmatrix}$ and using the sample summary statistics of $\bar{x}$ and $S$. The maximum correlation between the pairs of $\hat{U}$ and $\hat{V}$ is the correlations between the transformations we found earlier which maximises the correlation but this time is for a random sample.

Looking at a pair of sample canonical variate pairs and looking at the coefficients for each variable, you can identify which variables it represents and the associations/correlations between the two as an interpretation.

Additional Sample Descriptive Measures

You can do matrices of errors of approximations which is how well the first r sample canonical variates reproduce the covariance matrices. Standardising the variables, you can also find the proportion of total standardised sample variance in teh first set explained by the linear combinations $\hat{U}_1, \dots, \hat{U}_r$. The same can be done for the second set.

Large Sample Inferences

To test if all the canonical correlations are 0 - to see if there is a point in pursuing a canonical correlation analysis. we test if $\Sigma_{12} = 0$. We do this using a likelihood ratio test where $H_0: \Sigma_{12} =0$ and uses $-2 \ln \Lambda = n \ln\left(\frac{|S_{11}||S_{22}|}{|S|}\right) = -n \ln \prod_{i=1}^p(1 -\widehat{\rho_i^{*2}})$

tags: mathematics - statistics

Hacker theme

Hacker is a theme for GitHub Pages.