Chapter 1
by Arpon Sarker
Applied Multivariate Statistical Analysis Chapter 1: Aspects of Multivariate Analysis
Introduction
As stated in my previous article, I will be focusing more heavily on mathematics - especially on the statistics side, comprising of linear algebra. At the outset is the focus on multivariate statistics, not requiring much calculus in the proofs in the textbook Applied Multivariate Statistical Analysis 6th ed. Pearson International Version by Richard A. Johnson and Dean W. Wichern. This first chapter focuses on aspects of multivariate analysis, meaning a discussion on basic features of analysying data: application, organisation, data representation, and distance. This book presents the data as part of scientific endeavour and investigating phenomena. They define scientific inquiry as an iterative learning process; objectives/hypotheses must be specified and tested here via the collection of many different variables. The objectives of scientific inquiry are:
- Data reduction/Structural simplification
- Sorting and Grouping
- Investigation of the dependence among variables
- Prediction
- Hypothesis construction and testing
An interesting quote by F. H. C. Marriot ascertained that the results of the data were secondary to informed opinion.
The Organisation of Data
Multivariate data can be represented using an array of $p \ge 1$ variables. For each distinct observation, they can be called items, individuals, or experimental unit. $x_{jk} = \text{maesurement of the kth variable on the jth item}$ You can display this as a rectangular array for $n$ items on $p$ variables, where the variables are the columns and the items are the rows as the array below: \(\textbf{X} = \begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1p} \\ \vdots & \vdots & \vdots & x_{2p} \\ x_{n1} & x_{n2} & \ldots & x_{np} \\ \end{bmatrix}\)
Descriptive Statistics
A large data set is quite hard to instantly extract pertinent information. Descriptive statistics assesses most of the information by calculating certain summary numbers. For the rest of this chapter and maybe rest of the book, we look at sample statistics where the $n$ measurements represent a subset of the full set of measurements.
Sample Mean: \(\bar{x}_k = \frac{1}{n} \sum_{j=1}^n x_{jk}, \quad \text{k = 1, 2,..., p}\)
This is a measure of location. This is just a simple arithmetic mean for each variable or column in the main data array $\textbf{X}$. This can also be represented as an array storing each mean as a row for the pth variable: \(\bar{x} = \begin{bmatrix} \bar{x_1} \\ \bar{x_2} \\ \vdots \\ \bar{x_p} \end{bmatrix}\)
Sample Variance: \(s_k^2 = s_{kk} = \frac{1}{n} \sum_{j=1}^n (x_{jk} - \bar{x})^2, \quad \text{k = 1, 2,..., p}\)
This is a measure of spread. The denominator can also be $n-1$ which has its advantages especially if $n$ is small. The $s_{kk}$ represents the main diagonal of the variance-covariance matrix which is the kth row and kth column of variable $k$. This matrix/array is below:
$$
S_n = \begin{bmatrix} s_{11} & \ldots & s_{1p}
s_{21} & \ldots & s_{2p}
s_{p1} & \ldots & s_{pp} \end{bmatrix}, \quad \text{k = 1, 2,…, p}
$$
Sample Standard Deviation: \(\sqrt{s_{kk}}\)
This is just the square root of the sample standard deviation. The neat thing about using the standard deviation is it uses the same units as the measurements, since it square roots $unit^2$ to $unit$ from the sample variance formula which is much more pertinent in comparing the spread of the dataset using the same units.
Sample Covariance: \(s_{ik} = \frac{1}{n} \sum_{j=1}^n (x_{ji} - \bar{x_i})(x_{jk} - \bar{x_k}), \quad \text{i,k = 1, 2,..., p}\)
This measures the linear association between the two variables (i and k) where if both variables have large values and small values with each other, the linear association is positive. If the value is large for 1 variable and small for another and vice versa, the association is negative. If there is no particular association, the association is approximately 0. This is also stored in the variance-covariance matrix
Sample Correlation Coefficient: \(r_{ik} = \frac{s_{ik}}{\sqrt{s_{ii}}\sqrt{s_{kk}}} = \frac{1}{n} \sum_{j=1}^n (\frac{x_{ji}- \bar{x_i}}{\sqrt{s_{ii}}}) (\frac{s_{jk} - \bar{x_k}}{\sqrt{s_{kk}}})\)
This is also called the Pearson’s product-moment correlation coefficient. This is also a measure of linear association between two variables but is useful as it doesn’t depend on units of measurement which means it is a standardised version of sample covariance. In sample covariance, the units are $unit1 \cdot unit2$ which is not intuitive to interpret and so we can’t compare one covariance to another. However, if we divide by their respective standard deviations, the formula is turned into the one on the right where we compute the Z-scores first then compute the covariance! Hence, sample correlation coefficient is sample covariance. It is easier to interpret correlation because:
- $r \in [-1,1]$
- r measures strength of linear association
- r = 0: no linear association
- r < 0: one value is larger than its average and the other is smaller than its average
- r > 0: both values are large and small together
- $r_{ik}$ remains unchanged for linear transformations below. This only works if a and c have the same sign:
- ith var: $y_{ji} = ax_{ji} + b$
- kth var: $y_{jk} = cx_{jk} + d$
Proof of Corr(aX+b, cY+d) = corr(X,Y)
\[corr(aX+b, cY+d) = \frac{cov(aX+b, cY+d)}{\sqrt{Var(aX+b)}\sqrt{Var(cY+d)}}\] \[= \frac{ac[cov(X,Y)]}{\sqrt{a^2Var(X)}\sqrt{c^2Var(Y)}}\] \[= corr(X,Y)\]This is represented as an array/matrix here: \(R_n = \begin{bmatrix} 1 & r_{12} & \ldots & r_{1p} \\ r_{21} & 1 & \ldots & r_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ r_{p1} & r_{p2} & \ldots & 1 \end{bmatrix}\)
Data Displays and Pictoral Representation
Scatterplot on on-main diagonal and boxplot of main diagonal using covariance array is a graphical technique to analyse the dataset visually. This can either be in n points in p dimensions (the axes being the variables) or p points in n dimensions (each axis being the item) which measures the closeness of points in n dimensions and is related to measures of association between corresponding variables.
Other data displays include: linking 2D scatterplots, using brushing to highlight points corresponding to one range of the variables; rotating plots on a 3D graph; growth curves; stars; and Chernoff Faces.
Distance
Euclidean Distance is just the Pythagorean Theorem between points to get a direct distance between lines, however, this does not take into account covariance and variance or the statistical structure or clustering of the dataset. For example, think of cloud of points making an ellipse, with Euclidean distance a point nearer to the mean of the cluster but is outside the cluster has a smaller distance than a point inside th cluster but further away. This makes no sense, since the points in the cluster should be closer as they are more similar to the data point of the mean but this is not the case in Euclidean distance. Hence, we use statistical distance.
Euclidean Distance: \(d(P,Q) = \sqrt{(x_1-y_1)^2 + \ldots + (x_p-y_p)^2}\) Below is the equation of a hypersphere, if p = 2. Centred around Q and all points being $c$ away from the centre. \(d^2(P,Q) = c^2 = (x_1-y_1)^2 + \ldots + (x_p-y_p)^2\)
Statistical Distance (x1 uncorrelated/independent of x2): \(d(P,Q) = \sqrt{(\frac{x_1-y_1}{\sqrt{s_{11}}})^2 + \ldots + (\frac{x_p-y_p}{\sqrt{s_{pp}}})^2} = \sqrt{\frac{(x_1-y_1)^2}{s_{11}} + \ldots + \frac{(x_p-y_p)^2}{s_{pp}}}\)
For the statistical distance formula here, we weight the distances where the variability is lower since largers values in these variables are much more unexpected than larger values of variables with high variance. Hence, we divide by the standard deviation to get this weighting.
Statistical Distance (x1 correlated/dependent of x2): \(\begin{equation} \begin{aligned} d^2(P,Q) = [a_{11}(x_1 - y_1)^2 + a_{22}(x_2-y_2)^2+2a_{12}(x_1-y_1)(x_2-y_2) \\ + 2a_{13}(x_1-y_1)(x_3-y_3) + \ldots + 2a_{p-1,p}(x_{p-1}-y_{p-1})(x_p-y_p)] \end{aligned} \end{equation}\)
This is just rotating the axes of the dataset to horizontal and vertical axes where some $\tilde{x_1}$ is written in terms of the original coordinates as $x_1 \cos{\theta} + x_2 \sin{\theta}$ and $\tilde{x_2}$ is written as $-x_1 \sin{\theta} + x_2 \cos{\theta}$. By formally substituting, into $\tilde{x}$ and its corresponding $\tilde{s} = \frac{1}{n} \sum_{j=1}^n (x_j - \bar{\tilde{x}})$ for each corresponding variable to get $a_{11}$ and all the others in the above equation which is needed such that the distance is nonnegative for all possible values of x’s. The proof is quite long but you have to start with the rotated sample variance and group up the constants into the $a$ variables.
Th coefficients $a$ can be described as an array: \(\begin{bmatrix} a_{11} & \ldots & a_{1p} \\ a_{12} & \ldots & a_{2p} \\ \vdots & \vdots & \vdots \\ a_{1p} & \ldots & a_{pp} \\ \end{bmatrix}\) The array is symmetric, accounting for the multiplication by 2 of the $a$ coefficients between two different variables.
An equation of a hypersphere exists here and for the formula above by squaring both sides and replacing $d(P,Q)^2$ with $c^2$, where $c$ is the constant distance away from the centre or the radius.
tags: mathematics - statistics