home | alphabetical index | |||||||

## Linear regressionLinear regression is a method of data analysis intended to be used with a set of paired observations on two variables on the same set of statistical units. Conventionally, we refer to one of the variables as independent (usually labeled ) and the other as dependent (labeled ).
The notion of an Historically, in applications to measurements in astronomy, the "error" was actually a random measurement error, but in many applications, ε is merely the amount by which the individual -value differs from the average -value among individuals having the same -value. The average value of the random "error" is zero. Often in linear regression problems statisticians rely on the Gauss-Markov assumptions: - The random errors have expected value 0.
- The random errors are uncorrelated (this is weaker than an assumption of probabilistic independence).
- The random errors are "homoscedastic", i.e., they all have the same variance.
Sometimes stronger assumptions are relied on: - The random errors have expected value 0.
- They are independent.
- They are normally distributed.
- They all have the same variance.
It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of is a line. But in fact, if the model is
linear regression, even though the graph is not a straight line. The rationale for this terminology will be explained below.
A statistician will usually Notice that, whereas the errors are independent, the residuals cannot be independent because the use of least-squares estimates implies that the sum of the residuals must be 0, and the dot-product of the vector of residuals with the vector of -values must be 0, i.e., we must have These facts make it possible to use Student's t-distribution with degrees of freedom (so named in honor of the pseudonymous "Student") to find confidence intervals for and .
Denote by capital X the n x 2 matrix whose second column contains the x as its _{i}ith entry, and whose first column contains n 1s. Let ε be the column vector containing the errors ε. Let δ and _{i}d be respectively the 2x1 column vector containing α and β and the 2x1 column vector containing the estimates a and b. Then the model can be written as
σ, where ^{2} I_{n}I is the _{n}n x n identity matrix. The matrix Xd (where (remember) d is the vector of estimates) is then the orthogonal projection of Y onto the column space of X.Then it can be shown that
X' is the transpose of X) and the sum of squares of residuals is
X(X'X) is a symmetric idempotent matrix is incessantly relied on both in computations and in proofs of theorems. The linearity of ^{-1}X'd as a function of the vector Y, expressed above by saying d = (X' X), is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.^{-1} X' Y
The matrix n - 2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that any real symmetric matrix M can be diagonalized by an orthogonal matrix G, i.e., the matrix G'MG is a diagonal matrix. If the matrix M is also idempotent, then the diagonal entries in G'MG must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So I, after diagonalization, has _{n}-X(X'X)^{-1}X'n-2 0s and two 1s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with n-2 degrees of freedom.
*Note: A useful alternative to linear regression is robust regression in which mean absolute error is minimized instead of mean squared error as in linear regression. Robust regression is computationally much more intensive than linear regression and is somewhat more difficult to implement as well.*
## Summarizing the dataand similarly.S similarly.
_{YY}## Estimating beta
We use the summary statistics above to calculate
## Estimating alphaWe use the estimate of beta and the other statistics to estimate alpha by:
## Displaying the residualsThe first method of displaying the residuals use the histogram or cumulative distribution to depict the similarity (or lack thereof) to a normal distribution. Non-normality suggests that the model may not be a good summary description of the data.
We plot the residuals, against the independent variable, - Residuals increase (or decrease) as the independent variable increases -- indicates mistakes in the calculations -- find the mistakes and correct them.
- Residuals first rise and then fall (or first fall and then rise) -- indicates that the appropriate model is (at least) quadratic. See polynomial regression.
- One residual is much larger than the others and opposite in sign -- suggests that there is one unusual observation which is distorting the fit --
- Verify its value before publishing
*or* - Eliminate it, document your decision to do so, and recalculate the statistics.
- Verify its value before publishing
## Ancillary statisticsThe sum of squared deviations can be partitioned as in ANOVA to indicate what part of the dispersion of the dependent variable is explained by the independent variable.
The r is frequently interpreted as the fraction of the variability explained by the independent variable, ^{2}X | |||||||

copyright © 2004 FactsAbout.com |