2 January 2025

Applied Multivariate Statistical Analysis Chapter 2 - Matrix Algebra and Random Vectors

by Arpon Sarker

Introduction

This chapter goes through the basics of matrix algebra and expounds on the notation and techniques utilised to work with data points and random variables as vectors and matrices, which is especially important in computer implementations and concisely state statistical models. A lot of this chapter doesn’t really speak of statistics or really gives a good idea how matrix algebra and statistics are related and used properly but I hope this is addressed in further chapters. My one gripe with this chapter is referring to variance as both $\sigma_i^2$ and then $\sigma_{ii}$.

Some Basics of Matrix and Vector Algebra

To calculate an angle between vectors $x^T = \begin{bmatrix} x_1 & x_2\end{bmatrix}$ and $y^T = \begin{bmatrix} y_1 & y_2 \end{bmatrix}$ with angles $\theta_1$ and $\theta_2$, respectively. Using basic trigonometry such as $\cos(\theta_1) = \frac{x_1}{L_x}$ and the trigonometric identity of $\cos(\theta) = cos(\theta_2 - \theta_1) = \cos(\theta_2) \cos(\theta_1) + \sin(\theta_2) \sin(\theta_1) \\ \cos(\theta) = \left(\frac{y_1}{L_y}\right) \left(\frac{x_1}{L_x}\right) + \left(\frac{y_2}{L_y}\right) \left(\frac{x_2}{L_x}\right) \\ \cos(\theta) = \frac{x_1 y_1 + x_2 y_2}{L_x L_y}$

The projection of vector $\textbf{x}$ onto vector $\textbf{y}$ is $\frac{x^Ty}{y^Ty}y$ and can be rewritten as $L_x

\frac{x^Ty}{L_x L_y}

= L_x

\cos(\theta)

Linearly independent vectors such that $c_1 a_1 + \ldots c_k a_k = 0$ only if $c_1 = \ldots = c_k = 0$ can be used as a condition where the k columns of $\textbf{A}$ is linearly independent.

Eigenvectors and eigenvalues satisfy the equation $\textbf{Ax =} \lambda \textbf{x}$ $\textbf{A}$ - a $\text{k x k}$ matrix has k pairs of eigenvalues and eigenvectors where the normalised eigenvectors are mutually perpendicular satisfy $e_1^Te_1 = \ldots = e_k^Te_k = 1$ and eigenvectors are unique unless two or more eigenvalues are equal (once an eigenvector is normalised).

Orthogonal and diagonal matrices are easy to find an inverse matrix since they are perpendicular. The inverse of the diagonal matrix is just the reciprocals of the elements.

Note: always check inverses are correct due to near-zero rounding errors on the computer.

Positive Definite Matrices

Squared distances (from chapter 1) and the multivariate normal density can be expressed in terms of quadratic forms. This is important since the study of the variation and interrelationships in multivariate data is often based upon distances and assuming that the data is multivariate normally distributed.

The spectral decomposition of a symmetric $\text{k x k}$ matrix $\textbf{A}$ is $\textbf{A} = \lambda_1 e_1 e_1^T + \ldots + \lambda_k e_k e_k^T$

The above is useful for a matrix explanation of distance.

A quadratic form is $\boldsymbol{x^TAx}$ since it only has $x_i^2$ terms and $x_ix_k$ terms. $\textbf{A}$ and the quadratic form is nonnegative definite if $0 \le \boldsymbol{x^TAx}$ and positive definite if $0 < \boldsymbol{x^TAx}$.

Matrix Explanation of Distance: Assume that the p elements of vector $\textbf{x}$ are p random variables $X_1, \ldots, X_p$. As with any vector, we can look at this as the coordinates of a point in p-dimensional space and the distance of the point should be interpreted in terms of standard deviation units which accounts for the variability in observations. From the $distance^2$ formula from last chapter: $$ (\text{distance})^2 = a_{11}x_1^2 + \ldots + a_{pp}x_p^2 + 2(a_{12}x_1x_2 + \ldots + a_{p-1,p}x_{p-1}x_p)
0 < (\text{distance})^2 = \begin{bmatrix} x_1, x_2, \ldots, x_p \end{bmatrix}

\begin{bmatrix} a_{11} & \ldots & a_{1p}
a_{12} & \ldots & a_{2p}
\vdots & \vdots & \vdots
a_{1p} & \ldots & a_{pp}
\end{bmatrix}

\begin{bmatrix} x_1 \ x_2 \ \vdots \ x_p\end{bmatrix}
0 < (\text{distance})^2 = \boldsymbol{x^TAx} $$

We can see that distance is determined from a positive definite quadratic form and conversely, a positive definite quadratic form can be interpreted as squared distance. Expressing distance as the the square root of a positive definite quadratic form allows a geometrical interpretation based on the eigenvalues and eigenvectors of $\textbf{A}$.

\[\textbf{A} = \lambda_1 e_1 e_1^T + \lambda_2 e_2 e_2^T + \ldots + \lambda_p e_p e_p^T \\ \boldsymbol{x^TAx} = \lambda_1 (x^T e_1 e_1^T x) + \lambda_2 (x^T e_2 e_2^T x) + \ldots + \lambda_p (x^T e_p e_p^T x) \\ \boldsymbol{x^TAx} = \lambda_1 (x^T e_1)^2 + \lambda_2 (x^T e_2)^2 + \ldots + \lambda_p (x^T e_p)^2, \text{ since inner products of x and e are equal}\]

This is an equation of an ellipse where $\textbf{x} = c \lambda_i ^{-\frac{1}{2}}e_i$ gives the appropriate distance in the $e_i$ direction where substituting that in where all other eigenvectors set to 0 results in $c^2$ and gives one of the axes in the ellipse. Thus, the points at distance c lie on an ellipise who axes are given by the eigenvectors with length proportional to the reciprocals of the square roots of the eigenvalues multiplied by c.

Square-Root Matrix

The spectral decomposition allows us to express the inverse, square root and the inverse square root. $\textbf{A} = \sum_{i=1}^k \lambda_i e_i e_i^T = \boldsymbol{P \Lambda P^T} \\ \textbf{A}^{-1} = \sum_{i=1}^k \frac{1}{\lambda_i} e_i e_i^T = \boldsymbol{P \Lambda^{-1} P^T} \\ \textbf{A}^{1/2} = \sum_{i=1}^k \sqrt{\lambda_i} e_i e_i^T = \boldsymbol{P \Lambda^{1/2} P^T}$

, where $\Lambda$ is the diagonal matrix of eigenvalues and $\textbf{P}$ is matrix of normalised eigenvectors as its columns.

Random Vectors and Matrices

A random vector and random matrix is a vector whose elements are random variables ($X_i$).

Mean Vectors and Covariance Matrices

Each element is a random variable that has its own marginal probability distribution.

Mean: $\mu_i = E(X_i)$ or, $\mu_i = \begin{cases} \int_{-\infty}^{\infty} x_if_i(x_i) dx_i & \text{if X is cts.} \\ \sum_{\text{all} x_i} x_i p_i(x_i) & \text{if X is discrete}\end{cases}$

Variance: $\sigma_i^2 = E(X_i - \mu_i)^2$ or, $\sigma_i^2 = \begin{cases} \int_{-\infty}^{\infty} (x_i - \mu_i)^2 f_i(x_i) dx_i & \text{if X is cts.} \\ \sum_{\text{all} x_i} (x_i - \mu_i)^2 p_i(x_i) & \text{if X is discrete}\end{cases}$

Covariance: This is defining the behaviour of $X_i$ and $X_k$ which is described their joint probability function $\sigma_{ik} = E(X_i - \mu_i)(X_k - \mu_k)$ $\sigma_{ik} = \begin{cases} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} (x_i - \mu_i) (x_k - \mu_k) f_{ik}(x_i,x_k) dx_i dx_k & \text{if Xi and Xk are cts.} \\ \sum_{\text{all} x_i} \sum_{\text{all} x_k} (x_i - \mu_i)(x_k - \mu_k) p_{ik}(x_i,x_k) & \text{if Xi and Xk are discrete}\end{cases}$

statistically independent random variables are able to express their joint probability function as a multiplication of their marginal functions and if all p variables can be expressed this way they are mutually statistically independent. The covariance is 0 if both variables are independent.

Population Correlation Matrix: $\rho$ is jus the matrix of 1 as its main diagonal and using the correlation formula defined before to populate the rest of the elements.

Standard Deviation Matrix: $V^{1/2}$ has the standard deviations or the square roots of the main diagonal elements found in the variance-covariance matrix and the off-diagonal elements are all 0. Two equations relate correlation, standard deviation, and the variance-covariance matrix together. $V^{1/2}\rho V^{1/2} = \Sigma \\ \rho = (V^{1/2})^{-1} \Sigma (V^{1/2})^{-1}$ This is easily verified as this just reverse the correlation formula by multiplying the standard deviations from the denominator to get the original variance/covariance. The second equation is just the inverse of this.

Partitioning Covariance Matrix (and sample mean and covariance matrix)

If we want to group the observations into two or more groups within the matrices and analyse the groups of these, this can be done by partitioning the subsets of random variables into the groups. $\left[ \begin{array}{c} X_1 \\ \vdots \\ X_q \\ \hline \\ X_{q+1} \\ \vdots \\ X_p \end{array} \right] = \left[ \begin{array}{c} X^{(1)} \\ \hline \\ X^{(2)} \end{array} \right] \\ \mu = E(X) = \left[ \begin{array}{c} \mu^{(1)} \\ \hline \\ \mu^{(2)} \end{array} \right] \\ \Sigma = E(X-\mu)(X-\mu)^T = \left[ \begin{array}{c|c} \Sigma_{11} & \Sigma_{12} \\ \hline \\ \Sigma_{21} & \Sigma_{22}\end{array} \right]$ The variance-covariance matrix splits the groups into the covariances with $\Sigma_{11}$ being the comparison of covariances between the random variables in group $X^{(1)}$ and the same for $\Sigma_{22}$. $\Sigma_{12} = \Sigma_{21}^T$ is the comparison of covariances between random variables from the differing groups. This model can be extended to $n$ groups with the main diagonals being intra-group covariances and the off-diagonal elements being inter-group covariances.

The above was all for population statistics and the definitions for sample statistics are $\bar{x}$ is the vector of sample averages from n observations of p random variables and $S_n$, the variance-covariance matrix, is each element $s_{ij} = \frac{1}{n}\sum_{j=1}^n(x_{ji} - \bar{x_i})(x_{jk} - \bar{x_k})$. Note: the sample statistics don’t require calculus/expectation. Partitioning works the same way here as the population versions.

Mean Vector and Covariance Matrix for Linear Combinations of Random Variables

The linear combination $c^TX =c_1X_1 + \ldots + c_pX_p$ has $\text{mean} = E(c^TX) = c^T\mu \\ \text{variance} = Var(c^TX) = c^T \Sigma c \\ \text{For } Z=CX, \\ \mu_Z = E(Z) = E(CX) = C\mu_X \\ \Sigma_Z = Cov(CX) = C\Sigma_X C^T$ The $c$ vector is just the coefficients of the linear combination where $X_1 - X_2$ and $X_1 + X_2$ is $\begin{bmatrix} 1 & -1 \ 1 & 1 \end{bmatrix}$. Here, the matrix version of multiple linear combinations corresponds to $C$.

Matrix Inequalities and Maximisation

Maximisation is important since this is used in Linear Discriminant Analysis which allocates observations to predetermined groups which maximises inter-group separation relative to intra-group variability, and principal components which are linear combinations of measurements with maximum variability.

Cauchy-Schwarz Inequality: $(b^Td)^2 \le (b^Tb)(d^Td) \\ (b^Td)^2 \le (b^TBb)(d^TB^{-1}d) \quad \text{(extended version)}$ The proof entails using $0 < (b-xd)^T(b-xd)$ and is equal only if $b=cd$ or $d=cb$ for some constant c, which I’ve proved starting with $0 = (b-cd)^T(b-cd)$. The extended version gives the following lemma, $\max_{x \neq 0} \frac{(x^Td)^2}{x^TBx} = d^TB^{(-1)}d$

The last proof that uses the above lemma is the maximisation of quadratic forms for points on the unit sphere where we assume $B$ is a positive definite matrix with eigenvalues $\lambda_1 \ge \ldots \ge \lambda_p \ge 0$ and corresponding normalised eigenvectors $e_1, \ldots, e_p$, then $\max_{x \neq 0} \frac{x^TBx}{x^Tx} = \lambda_1 \\ \min_{x \neq 0} \frac{x^TBx}{x^Tx} = \lambda_p \\ \max_{x \perp e_1, \ldots, e_p} \frac{x^TBx}{x^Tx} = \lambda_{k+1}$ Unfortunately, I didn’t understand the proof entirely but the result of these equations give the largest and smallespoints whose distance is 1 from the origin and represent the extreme values of $x^TBx$ for points on the unit sphere. For intermediate eigenvalues, they can also be interpreted as extreme values when x is restricted to be perpendicular to the previous eigenvectors.

Supplement

Linearly dependent vectors can be rewritten as one vector as a linear combination of the other vectors switched to the RHS.

The two differences between algebra of real numbers and algebra of matrices are $AB \neq BA$ and if $AB = 0$ then it doesn’t necessarily mean that $A=0$ or $B=0$.

The row rank is equal to column rank which are the maximum number of linearly independent rows or columns.

If $Ax=0$ implies only that $x=0$ then A is nonsingular.

$tr(B^{(-1)}AB) = tr(A)$

A corresponding eigenvalue can only have one unique normalised eigenvector.

tags: mathematics - statistics

Hacker theme

Hacker is a theme for GitHub Pages.