The Normal Equations, represented in matrix form as
\[ (X^{T}X)\hat{\beta} = X^{T}y \]
are utilized in determining coefficient estimates associated with regression models. The matrix form is a compact representation of the model specification commonly represented as
\[ y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} + \varepsilon \]
where \(\epsilon\) represents the error term, and
\[ \sum_{i=1}^{n} \varepsilon_{i} = 0. \]
For a dataset with \(n\) records by \(k\) explanatory variables per record, the components of the Normal Equations are:
- \(\hat{\beta} = (\hat{\beta}_{0},\hat{\beta}_{1},\cdots,\hat{\beta}_{k})^{T}\), a vector of \((k+1)\) coefficents (one for each of the k explanatory variables plus one for the intercept term)
- \(X\) , an \(n\) by \((k+1)\)-dimensional matrix of explanatory variables, with the first column consisting entirely of 1’s
- \({y} = (y_{1}, y_{2},...,y_{n})\), the response
The task is to solve for the \((k+1)\) \(\beta_{j}\)’s such that \(\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}\) minimize
\[ \sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2. \]
The Normal Equations can be derived using Least-Squares and Maximum likelihood Estimation.
Least-Squares Derivation
Unlike Maximum Likelihood derivation, the Least-Squares approach requires no distributional assumption. For \(\hat{\beta}_{0}, \hat{\beta}_{1}, \cdots ,\hat{\beta}_{k}\), we seek estimators that minimize the sum of squared deviations between the \(n\) response values and the predicted values, \(\hat{y}\). The objective is to minimize
\[ \sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2. \]
Using matrix notation, our model can be represented as \(y = X^{T}\beta + \varepsilon\). Isolating and squaring the error term yields
\[ \hat \varepsilon^T \hat \varepsilon = \sum_{i=1}^{n} (y - X\hat{\beta})^{T}(y - X\hat{\beta}). \]
Expanding the right-hand side and combining terms results in
\[ \hat \varepsilon^T \hat \varepsilon = y^{T}y - 2y^{T}X\hat{\beta} + \hat{\beta}X^{T}X\hat{\beta} \]
To find the value of \(\hat{\beta}\) that minimizes \(\hat \varepsilon^T \hat \varepsilon\), we differentiate \(\hat \varepsilon^T \hat \varepsilon\) with respect to \(\hat{\beta}\), and set the result to zero:
\[ \frac{\partial \hat{\varepsilon}^{T}\hat{\varepsilon}}{\partial \hat{\beta}} = -2X^{T}y + 2X^{T}X\hat{\beta} = 0 \]
Which can then be solved for \(\hat{\beta}\):
\[ \hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y \]
Since \(\hat{\beta}\) minimizes the sum of squares, \(\hat{\beta}\) is called the Least-Squares Estimator.
Maximum Likelihood Derivation
For the Maximum Likelihood derivation, \(X\), \(y\) and \(\hat{\beta}\) are the same as described in the Least-Squares derivation, and the model still follows the form
\[ y = X^{T}\beta + \varepsilon \]
but now we assume the \(\varepsilon_{i}\) are \(iid\) and follow a zero-mean normal distribution:
\[ N(\varepsilon_{i}; 0, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{- \frac{(y_{i}-X^{T}\hat{\beta})^{2}}{2\sigma^{2}}}. \]
In addition, the responses, \(y_{i}\), are each assumed to follow a normal distribution. For \(n\) observations, the likelihood function is
\[ L(\beta) = \Big(\frac{1}{\sqrt{2\pi\sigma^{2}}}\Big)^{n} e^{-(y-X\beta)^{T}(y-X\beta)/2\sigma^{2}}. \]
The Log-Likelihood is then
\[ \mathrm{Ln}(L(\beta)) = -\frac{n}{2}\mathrm{Ln}(2\pi) -\frac{n}{2}\mathrm{Ln}(\sigma^{2})-\frac{1}{2\sigma^{2}}(y-X\beta)^{T}(y-X\beta). \]
Taking derivatives with respect to \(\beta\) and setting the result equal to zero yields
\[ \frac{\partial \mathrm{Ln}(L(\beta))}{\partial \beta} = -2X^{T}y -2X^{T}X\beta = 0. \]
Rearranging and solving for \(\beta\) we obtain
\[ \hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y, \]
which is the same result obtained via Least Squares.