Linear regression [session notes]

Author

Denis Schluppeck

Session date:

December 14, 2022

Introduction

Some details about fitting line through data points as a canonical example, but also relationship to other more complex examples that can be cast as linear regression problems. MvR referred to Numerical Recipes for really good theoretical background, but also practical advice Press et al. (1992). You can try here or google for a PDF copy of the relevant chapter 15.0 on Modelling data) :

Press, William H., Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. 1992. Numerical Recipes in c. Second. Cambridge, USA: Cambridge University Press.

fitting polynomials is also linear problem
reality check: any function that’s linear in parameters is also ok, eg. \(y(t; a, b, c) = a\cdot e^{-t} + b\cdot x + c\)

Basic ideas

Derivation from a Bayesian point of view, assuming a flat prior to yield the maximum likelihood solution.

Bayes formulation with data \(\mathcal{D}\) and parameters \(w\) :

\[p(w | \mathcal{D}) = \frac{p(\mathcal{D} | w )p(w)}{p(\mathcal{D})} \]

but in the case of a flat prior on the parameters, \(p(w)\), and given that \(p(\mathcal{D})\) is just a normalising constant, we can find the maximum of \(p(w | \mathcal{D})\), by maximising \(p(\mathcal{D} | w )\), also known as the likelihood.

For iid Gaussian noise, the likelihood becomes

\[\mathcal{L} = \prod_i \Big\{ \exp{\Big[-\frac{1}{2}\Big(\frac{y_i -y(x_i)}{\sigma} \Big)^2} \Big] \Delta y\Big\}\]

and maximising the likelihood is equivalent to minimising the negative log of the likelihood (numerically better behaved) \[-\log \mathcal{L} = \sum_i \frac{\Big(y_i -y(x_i)\Big)^2}{2\sigma^2} -N \log \Delta y\]

Pulling out all the terms that are constant (don’t change with the parameters), this is equivalent to minimising the sum of squared errors between data and the fit.

Relationship to \(\chi^2\) fitting

If the errors vary with each measurement point (rather than being of a fixed, single \(\sigma\)), then these errors can be included in the quantity to be minimised and leads to the \(\chi^2\) statistic

\[\chi^2 = \sum_i \Big( \frac{y_i-y(x_i; a_1\dots, a_M )}{\sigma_i} \Big)^2 \]

where \((x_i, y_i)\) are data points with an associated error \((\sigma_i)\). For the Gaussian case, s \(\chi^2\) value of a moderately good fit is on the order of the degrees of freedom \(\nu = N-M\) (number of measurement points minus number of parameters).

Linear algebra picture

Ideas

Consider the data \(\mathbf{y}\) as a vector in some space and \(\mathbf{X}\), the design matrix with an associated column space.

\(\mathbf{y}\) is usually not in the column space of \(\mathbf{X}\) (eg. a set of many \(y_i\) values measured at \(x_i\) are unlikely to fall onto a line, which is parameterised by two values).

But we can find an \(\mathbf{X}\mathbf{\hat{\beta}}\), such that the distance to the data \(\mathbf{y}\) is smallest. This error \(\mathbf{e}\) is orthogonal to the space defined by \(\mathbf{X}\), so the dot products of \(\mathbf{e}\) which each columns in \(\mathbf{X}\) must be \(0\).

This leads to: \[ \begin{eqnarray} \mathbf{e}^T\mathbf{X} &=& 0 \\ (\mathbf{y} - \mathbf{X}\mathbf{\hat{\beta}})^T\mathbf{X} &=& 0 \\ \mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{\hat{\beta}}) &=& 0 \\ \mathbf{X}^T\mathbf{X}\mathbf{\hat{\beta}} &=& \mathbf{X}^T\mathbf{y} \\ \mathbf{\hat{\beta}} &=& (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \\ \end{eqnarray} \]

See also “The Linear Algebra Behind Linear Regression” (2020) and Geer (2019)

“The Linear Algebra Behind Linear Regression.” 2020. GoDataDriven. https://godatadriven.com/blog/the-linear-algebra-behind-linear-regression/.

Geer, Ruben van de. 2019. “A Primer (or Refresher) on Linear Algebra for Data Science.” YouTube. https://www.youtube.com/watch?v=Qz58vTa8-SY.