Applied Multivariate Statistical Analysis Chapter 7 - Multivariate Linear Regression Models
by Arpon Sarker
Introduction
The applications of multivariate linear regressions are for prediction. This chapter goes through different example of linear regression models with one response variable or many and whether the predictor variables are jointly distributed are not. Response variables are the $Y$ variables, or the dependent variables you’re trying to predict whereas, predictor variables are the $z_1, \ldots, z_r$ variabless or the independent variables you are given.
The Classical Linear Regression Model
The linear regression model with a single response variable is \(Y = \beta_0 + \beta_1z_1 + \ldots + \beta_rz_r + \varepsilon \\ \text{(Response) = (mean of Y(depending on predictors)) + (error)}\)
The error terms have the following properties of their expectations being 0, a constant variance, and 0 covariance between different trials/observations. In matrix form this is \(\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n\end{bmatrix} = \begin{bmatrix}1 & z_{11} & \ldots & z_{1r} \\ 1 & z_{21} & \ldots & z_{2r} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & z_{n1} & \ldots & z_{nr}\end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_r\end{bmatrix} + \begin{bmatrix}\varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n\end{bmatrix}\) Here, the multivariate properties are $E(\varepsilon)=0$ and $\text{Cov}(\varepsilon) = \sigma^2I$. The $\textbf{Z}$ matrix is a design matrix and has ones in the first column to be able to multiply the intercept or constant $\beta_0$.
Least Squares Estimation
To “fit” the model to the observed $y$ values, we must determine the regression coefficient $\beta$ and the error variance $\sigma^2$ consistent with the available data. If $\textbf{b}$ were the “true” parameter vector, it would be the minimum of the difference between the observed response and the predicted value $y_j - b_0 - b_1z_{j1} - \ldots - b_rz_{jr}$. The method of least squares selects $\textbf{b}$ to minimise the sum of squares of differences: \(S(b) = \boldsymbol{(y-Zb)^T(y-Zb)}\)
The coefficients $\hat{\beta}$ are the estimated/fitted mean responses, where the sum of squares of differences from the observed $y_j$ is as small as possible. The deviations $\hat{\varepsilon} = y_j - \hat{\beta}0 - \ldots - \hat{\beta}_rz{jr}$ are called residuals and the vector version $\hat{\varepsilon} = y-Z\hat{\beta}$ contains information about $\sigma^2$
Sum-of-Squares Decomposition
The sum of squares about the mean suggests that model’s quality of fit can be measured by the coefficient of determination $R^2$. A metric I’ve used in basically all my high school reports on Excel (intuitively knowing it measured correlation and quality of “line of best fit” to data). \(\sum_{j=1}^n (y_j - \bar{y})^2 = \sum_{j=1}^n(\hat{y}_j - \bar{y})^2 + \sum_{j=1}^n\hat{\varepsilon}_j^2 \\ \text{(total sum of squares about mean) = (regression sum of squares) + (residual/error sum of squares)} \\ R^2 = \frac{\sum_{j=1}^n(\hat{y}_j - \bar{y})}{\sum_{j=1}^n (y_j - \bar{y})}\)
Geometry of Least Squares
From the model, the structure of the mean response vector is $$ \text{Mean Response Vector} = E(Y) = Z\beta = \beta_0\begin{bmatrix}1 \ 1 \ \vdots \ 1\end{bmatrix}
- \beta_1\begin{bmatrix}z_{11} \ z_{21} \ \vdots \ z_{n1}\end{bmatrix} + \ldots + \beta_r\begin{bmatrix}z_{1r} \ z_{2r} \ \vdots \ z_{nr}\end{bmatrix} $$
$Z\beta$ spans the model plane of all linear combinations and usually the observation vector $y$ will not lie in the model plane because of random error $\varepsilon$. $Y = Z\beta + \varepsilon$. Given the model plane, the calculation of least squares estimation gives the deviation vector $\textbf{y - Zb}$ between the observation vector and vector in model plane. The squared length of the deviation vector is the sum of squares $S(b)$ as this is the smallest possible value we can get since we selected $b$ such that $\textbf{Zb}$ is the point in the model plane closest to $y$. In other words, $\hat{y} = Z\hat{\beta}$ is the projection of $y$ on the plane consisting of all linear combinations of $Z$ (its span) and the residual vector between them both $\hat{\varepsilon} = y - \hat{y}$ is perpendicular to that plane as a projection (that is perpendicular, to get the minimum distance).
Inferences about the Regression Model
The sampling distribution of $\hat{\beta}$ and the residual sum of squares $\varepsilon^T\varepsilon$ must be found to make inferences. Like the last chapter, the F-distribution is used here to get the confidence region and simultaneous confidence intervals but this can also be replaced by the one-a-a-time t value for the intervals of $ \hat{\beta}$.
In order to investigate the effects of a particular predictor variable on the response variable, we need to conduct a hypothesis test where the null hypothesis is $H_0: \beta_{q+1} = \beta_{q+2} = \ldots = \beta_r = 0$ or $H_0: \beta_{(2)} = 0$ where the $z_q$’s are the predictors we are investigating that do not influence $Y$. To reject the null hypothesis, we conduct a likelihood ratio test comparing the maximum of the likelihood functions of the rest of the predictors that are not being investigated $\beta_{(1)}$ and all the predictors and variance $\sigma^2$.
Inferences from the Estimated Regression Function
The linear regression problems can be used to solve two problems: (1) to estimate the regression function at $z_0$ and (2) to estimate the value of the response Y at $z_0$. To estimate the regression function at $z_0$, we use the expectation $E(Y_0|z_0) = z_0^T\beta$ which is the least squares estimate. The corresponsing confidence interval given is based on the t-distribution (one-a-a-time). For forecasting a new observation at $z_0$, the new variance (forecast error) and the prediction interval was also given based on the t-distribution. It was shown that the interval for predicting new values rather than estimating the regression function was much wider due to the addition of the unknown error term $\varepsilon_0$.
Model Checking and Other Aspects of Regression
Once we have our model, we need to check if the model is “correct” enough to make inferences. Here we use studentised residuals. These residuals can be plotted a number of ways to check for possible anomalies:
- Plot the residuals $\hat{\varepsilon}_j$ against the predicted values
- Plot the residuals $\hat{\varepsilon}_j$ against a predictor variable. If there is some pattern here, it suggests the need for more terms in the model
- Q-Q plots and histograms
- Plot the residuals versus time
Leverage and Influence
This checks whther some outliers in the response or explanatory variables are hidden which may have a considerable effect on the analysis. These outliers may determine the fit before we even make plots to check correctness. This checks using leverage which is just the diagonal entry of the hat matrix, to figure out which observations significantly affect inferences, and are said to be influential.
Selecting Predictor Variables from a Large Set
A $C_p$ plot for each subset of predictors will indiacte models that forecast the observed responses well aand good models are nearer to the 45-degree line. An algorithm for selecting predictor variables out of many candidates are:
- All possible simple linear regressions are considered. The predictor variable that explains the largest proportion of $Y$ is the first variable to enter the regression function.
- The next largest significant contributor to the regression sum of squares (not already in function) is entered in based on its F-test. The F to enter is the F-statistic that must be exceeded before the contribution of a variable is deemed significant.
- Once an additional variable is added to the equation, the individual contributions to the regression sum of squares of the other variables is checked using F-tests and if its F-statistic is less than the F to remove level, it is deleted from the function.
- Repeat steps 2 and 3 until all possible additions are nonsignificant nad all possible deletions are significant.
Another popular criterion in selecting predictor variables is the Akaike’s Information Criterion (AIC) which gives a small residual sum of squares but the second term penalises too many parameters. We want to select the models with smaller values of AIC.
Multivariate Regression Model
The multivariate regression model just makes a matrix of multiple linear regression models but stacked on top of each other to account for the multiple response variables $Y_1, \ldots, Y_m$. The multivariate version of the likelihood ratio tests for regression parameters is to use Wilks’ lambda $\Lambda$ to see if the predictors of $\beta_{(2)}$ do not influence the multple responses. Other multivariate test statistics may be used such as Pillai’s trace, Hotelling-Lawley trace, and Roy’s greatest root. The predictions made are based on a confidence ellipse (for two responses) based on a $T^2$-statistic against the F-distribution and the simultaneous prediction intervals and confidence ellipsoids are based on the F-distribution. The prediction ellipse is bigger than the confidence ellipse because, as stated before, the unknown error term $\varepsilon_0$ adds more uncertainty to the prediction interval.
The Concept of Linear Regression
This talks about the linear regression model in a different setting where all the variables are random $Y, Z_1, Z_2, \ldots , Z_r$ and have a joint distribution. This can be done by partitioning the means and covariance matrices to get a linear predictor but I skipped on most of this subchapter.
Comparing Two Formulations of the Regression Model
This compares both the joint distribution model introduced just before or as fixed variables $z_j$ which can predict one or several response variables. The conclusions show they reach the exact same conclusions and have the same linear predictors and best linear predictors. The two formulations are quite different conceptually where the model of the values of the input variables are assumed to be set by the experimenter and the conditional version, shows that the values of the predictor variables are random variables that are observed alongside the values of the response variables.
Multiple Regression Models
When data are collected over time, observations at different times are often related or autocorrelated. However, time dependence can invalide inferences made using the independence assumption. The solution here is to use autocorrelations and construct a regression model for a possibly dependent series of noise terms $N_j$ which is in the form of an autoregressive model.
tags: mathematics - statistics