Hacker theme

Hacker is a theme for GitHub Pages.

Download as .zip Download as .tar.gz View on GitHub
8 January 2025

Applied Multivariate Statistical Analysis Chapter 4 - The Multivariate Normal Distribution

by Arpon Sarker

Introduction

This is - in my humble opinion - the most important chapter as everything before this was set up to explain this distribution and everything after will investigate or use this distribution in some way. Why a multivariate normal distribution?

  1. serves as a bona fide population model
  2. the sampling distributions of many multivariate statistics are approximate regardless of parent population distribution (central limit effect)

A lot of logic here can be attributed as an extension of the univariate normal density.

The Multivariate Normal Density

The multivariate density function: \(f(\textbf{x}) = \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} e^{-\frac{1}{2} (x-\mu)^T\Sigma^{-1}(x-\mu)}\)

The constant $(2\pi)^{p/2} \Sigma ^{1/2}$ makes it so that the volume under the multivariate density function’s surface is in unity for any $p$.

The graphs of the multivariate distributions ($p=3$ or $p=2$) show that uncorrelated distributions are centred around the mean whereas, correlated variables cause the probability to concentrate along a line.

The term $(x-\mu)^T\Sigma^{-1}(x-\mu)$ in the exponent is just the distance from some point $\textbf{x}$ to the mean in s.d. units which is analogous to the univariate case: $\frac{(x-\mu)^2}{\sigma}$. This is just the quadratic form of the distance function previously talked about.

\(\text{Constant probability density contour} = {\{\textbf{x}: (x-\mu)^T\Sigma^{-1}(x-\mu) = c^2\}}\) The above is just the surface of an ellipsoid centred at the mean where the axes are in the direction of the eigenvectors of $\Sigma^{-1}$ and the lengths are proportional to $\frac{c}{\sqrt{\lambda_i}}$. The properties below help us with calculating the pdf, the contours, or any other additional properties.

Properties

  1. The first property is not of the multivariate density function but of the $\Sigma^{-1}$ found in its exponent. Calculating the eigenvectors and eigenvalues is challenging so lets just calculate these in relation to $\Sigma$.
    \(\Sigma e = \lambda e \implies \Sigma^{-1}e = (\frac{1}{\lambda})e\) Hence, to calculat the contour of constant probability just use the same eigenvectors and the reciprocal eigenvalues of $\Sigma$. So this time, the directions are based on the eigenvectors but the length of axes are now $\pm c\sqrt{\lambda_i}$.

  2. So the above was to find contours of constant probability, now we want to find contours containing some percentage of probability. This leads us to the next property: \((x-\mu)^T\Sigma^{-1}(x-\mu) \le \chi^2_p(\alpha)\)

which has probability $1-\alpha$

  1. Uncorrelated ($\rho = 0$) variables lead to writing the multivariate density function as a product of univariate density functions. We show this using an example: $$ \text{Given the normal density function in terms of correlation for p=2:}
    f(x_1,x_2) = \frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}(1-\rho_{12}^2)}} \cdot \exp { -\frac{1}{2(1-\rho_{12}^2)} \left( \frac{x_1-\mu_1}{\sqrt{\sigma_{11}}}\right)^2 + \left( \frac{x_2-\mu_2}{\sqrt{\sigma_{22}}}\right)^2
    -2 \rho_{12}\left( \frac{x_1-\mu_1}{\sqrt{\sigma_{11}}}\right)\left( \frac{x_2-\mu_2}{\sqrt{\sigma_{22}}}\right)}
    \text{set correlation term to 0} \

f(x_1,x_2) = \frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}}} \cdot \exp { -\frac{1}{2} \left( \frac{x_1-\mu_1}{\sqrt{\sigma_{11}}}\right)^2 + \left( \frac{x_2-\mu_2}{\sqrt{\sigma_{22}}}\right)^2 }
f(x_1,x_2) = \frac{1}{2\pi\sqrt{\sigma_{11}}} \cdot e^{-\frac{1}{2}\left( \frac{x_1-\mu_1}{ \sqrt{\sigma_{11}}}\right)^2} \cdot \frac{1}{2\pi\sqrt{\sigma_{22}}} \cdot e^{-\frac{1}{2}\left( \frac{x_2-\mu_2}{\sqrt{\sigma_{22}}}\right)^2}
\therefore f(x_1,x_2) = f(x_1)f(x_2) $$

  1. Linear combinations of the components of $X$ are normally distributed.

  2. All subsets (partitions) of the components of $X$ have a multivariate normal distribution.

  3. Zero covariance implies that the corresponding components multivariate are independently distributed.

  4. The conditional distributions of the components are multivariate normal. $f(x_1 x_2) = \frac{f(x_1, x_2)}{f(x_2)}$

Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation (MLE)

The joint density of $X_1 , \ldots , X_n$ is \(\frac{1}{(2\pi)^{np/2} |\Sigma|^{n/2}}e^{-\frac{1}{2} \sum_{j=1}^n (x_j - \mu)^T \Sigma^{-1} (x_j - \mu)}\)

Once the observations of $x_1, \ldots, x_n$ are found this is just a function of $\mu$ and $\Sigma$ and now becomes the likelihood.

Note: Probability entails how likely a value is observed given a distribution that is fixed. However, likelihood is, given an observation, the plausibility of the distribution’s parameters fitting that observation and the maximum likelihood estimation is the “best” parameters that explain the data which will be done by maximising the joint density function above.

The maximum likelihood estimators are $\hat{\mu} = \bar{X}$ and $\hat{\Sigma} = \frac{1}{n}\sum_{j=1}^n(X_j-\bar{X})(X_j-\bar{X})^T=\frac{(n-1)}{n}S$. It can also be in the form of $L(\hat{\mu}, \hat{\Sigma}) = \text{constant} \cdot \text{generalised variance}^{-n/2}$ \(L(\hat{\mu}, \hat{\Sigma}) = \frac{1}{(2\pi)^{np/2}}e^{-np/2} \frac{1}{|\hat{\Sigma}|^{n/2}}\) The above expression is the maximum of the likelihood.

$\bar{X}$ and $S$ are deemed sufficient statistics as all of the information of $\mu$ and $\Sigma$ from the data matrix is contained in the $\bar{X}$ and $S$.

The Sampling Distributions of $\bar{X}$ and $S$

The mean of $\bar{X}$ is just $\mu$ and the variance is $(n-1)s^2 = \sum_{j=1}^n = (X_j - \bar{X})^2$. The sampling distribution of the sample covariance matrix is called the Wishart distribution and is the sum of independent products of multivariate normal random vectors.

Large-Sample Behaviour of $\bar{X}$ and $S$

Note: the conclusions found here are pertinent for form of parent population and only requires having a mean $\mu$ and covariance $\Sigma$

The law of large numbers states that \(\bar{X} = \frac{X_1 + \ldots + X_n}{n} \text{converges in prob. to mu as n increases}\)

The central limit theorem states: \(\sqrt{n}(\bar{X} - \mu) \text{has an approximate } N_p(0, \Sigma) \text{ distribution for large sample sizes.}\)

Assessing the Assumption of Normality

To assess whether each vector observation or data matrix follows a multivariate normal distribution can be found by constructing a Q-Q plot which compares the quantiles of the data against the expected quantiles - if normally distributed - and should be a straight line. Using the correlation coefficient, we can test the hypothesis of normality at a level of significance, since the ‘straightness’ of a Q-Q plot is correlation. Another method is using a $\chi_p^2$ plot which is based on the squared distances from the mean which is much more formal.

Detecting Outliers and Cleaning Data

A simple way of detecting outliers is just tabulating the squared distances from the mean and picking out any values that are too small or too big.

Transformations to Near Normality

With the Q-Q plots or $\chi_p^2$ plots, we may look at these graphs and see the potential of transforming the data into another representation which follows the normal distribution and its properties much more closely. The book lists some helpful transformations such as square rooting the counts, doing a logit function for proportions - where $\hat{p}{ij}$ is just the interval under the curve of $(\bar{x_i} - j\sqrt{s{ii}},\bar{x_i} + j\sqrt{s_{ii}})$ - and doing Fisher’s z-transformation for correlations. Another method is using the data to suggest a transformation by using power transformations. Depending on the exponent and fraction will “pull in” the data and exponents above 1 will be “pulled out” to improve the symmetry about the mean. Plotting $\ell(\lambda)$ - the solution that maximises the expression of the Box-Cox transformations- versus $\lambda$ to get the maximum which improves the approximation to normality but this is no guarantee. For multivariate observations, just transform each variable or marginal distributions to near normality. If this doesn’t work get all $\hat{\lambda_1} , \ldots, \hat{\lambda_p}$ from the previous step and maximise the expression of $\ell(\lambda_1, \lambda_2, \ldots , \lambda_p)$. Although, this second method is unlikely to yield significantly better results.

tags: mathematics - statistics