Linear Regression (LR) Models

is a type of continuous regression model whose function/estimator is linear with respect to the regression coefficients {𝜃₀, ..., 𝜃_𝑝}:
𝑦̂ = 𝜃₀ + 𝜃₁𝑓₁(𝒙) + ... + 𝜃_𝑝𝑓_𝑝(𝒙)
models the relationship between:
- 𝑌 - a single scalar response/dependent variable (for categorical use logistic regression)
- {𝑋₁, ..., 𝑋_𝑘} - one or more regressors or explanatory/predictor/covariate/independent variables. predictor variable types:
  - continuous/scalar/numerical predictor
  - discrete/categorical predictor - itself can be either nominal or ordinal
models mean/expected response as a function/conditional of regressors (where 𝑓_𝑖(..) are feature functions)
- 𝐄[𝑌|𝑋₁=𝑥₁, ..., 𝑋_𝑘=𝑥_𝑘] = ℎ(𝑥₁, ..., 𝑥_𝑘) = 𝑦̂ = 𝜃₀+ 𝜃₁𝑓₁(𝑥₁, ..., 𝑥_𝑘) + ... + 𝜃_𝑝𝑓_𝑝(𝑥₁, ..., 𝑥_𝑘)
- coefficient 𝜃₀represents the 𝑦 intercept when all feature functions 𝑓_𝑖(..) equate to 0
- coefficient 𝜃_𝑖 represents the mean change in the dependent variable 𝑦 given a 1 unit change in the independent feature function 𝑓_𝑖(𝑥₁, ..., 𝑥_𝑘) # for 1≤𝑖≤𝑝
is a type of level-level model (or even a level-log model when 𝑓_𝑖(..) are log functions)
the dependent variable 𝑦 is the combination of the regression model and error
- 𝑦 = 𝑦̂ + 𝑒
- dependent variable = (constant + independent variables) + error
- dependent variable = deterministic + stochastic
- deterministic component is the portion of the variation in the dependent variable that the independent variables explain. In other words, the mean of the dependent variable is a function of the independent variables. In a regression model, all of the explanatory power should reside here
- error is the difference between the expected value 𝑦̂ and the observed value 𝑦. Let’s put these terms together—the gap between the expected and observed values must not be predictable. Or, no explanatory power should be in the error. If you can use the error to make predictions about the response, your model has a problem. This issue is where residual plots play a role.
- the theory here is that the deterministic component of a regression model does such a great job of explaining the dependent variable that it leaves only the intrinsically inexplicable portion of your study area for the error. If you can identify non-randomness in the error term, your independent variables are not explaining everything that they can

LR - Steps

given sample/training data:

(𝑦₁, 𝑥₁₁, ..., 𝑥_1𝑘) # sample 1
(𝑦₂, 𝑥₂₁, ..., 𝑥_2𝑘) # sample 2
...
(𝑦_𝑛, 𝑥_𝑛1, ..., 𝑥_𝑛𝑘) # sample 𝑛

the task of Linear Regression:

choose line equation form, such as:
- 𝐄[𝑌|𝑋₁=𝑥₁] = 𝑦̂ = ℎ(𝑥₁) = 𝜃₀+ 𝜃₁𝑥₁# univariate linear regression
- 𝐄[𝑌|𝑋₁=𝑥₁, 𝑋₂=𝑥₂] = 𝑦̂ = ℎ(𝑥₁,𝑥₂) = 𝜃₀+ 𝜃₁𝑥₁ + 𝜃₂𝑥₂# multivariate linear regression
- 𝐄[𝑌|𝑋₁=𝑥₁, 𝑋₂=𝑥₂] = 𝑦̂ = ℎ(𝑥₁,𝑥₂) = 𝜃₀+ 𝜃₁𝑥₁𝑥₂+ 𝜃₂𝑥₁² + 𝜃₃𝑥₂# multiple linear regression
where:
- 𝐄[𝑌|..] and 𝑦̂ and ℎ(..) - scalar response/dependent variable or hypothesis function conditional on 𝑥_𝑖's
- 𝑥_𝑖 - regressors or explanatory/predictor/covariate/independent variables
- 𝜃_𝑖 - regression coefficients/weights
estimate/find the values of the regression coefficients 𝜃_𝑖which best fit the line equation to the data
determine whether its a goodfit

LR - Types

LR Type

Model Form

Example Models

Univariate Linear Regression

𝐄[𝑌|𝑋₁=𝑥₁] = ℎ(𝑥₁) = 𝑦̂ = 𝜃₀+ 𝜃₁𝑓₁(𝑥₁)

𝜃₀+ 𝜃₁𝑥₁
𝜃₀+ 𝜃₁𝑥₁²

Multivariate Linear Regression

𝐄[𝑌|𝑋₁=𝑥₁, ..., 𝑋_𝑘=𝑥_𝑘] = ℎ(𝑥₁, ..., 𝑥_𝑘) = 𝑦̂ =:

𝜃₀+ 𝜃₁𝑓₁(𝑥₁) + ... + 𝜃_𝑘𝑓_𝑘(𝑥_𝑘)
𝜃₀+ 𝜃₁𝑓₁(𝑥₁, ..., 𝑥_𝑘) + ... + 𝜃_𝑘𝑓_𝑘(𝑥₁, ..., 𝑥_𝑘)

𝜃₀+ 𝜃₁𝑥₁ + ... + 𝜃_𝑘𝑥_𝑘
𝜃₀+ 𝜃₁𝑥₁³ + ... + 𝜃_𝑘𝑠𝑖𝑛(𝑥_𝑘)
𝜃₀+ 𝜃₁𝑥₁𝑥₃ + ... + 𝜃_𝑘𝑥₄⁶𝑥_𝑘
𝜃₀+ 𝜃₁𝑥₁𝑥_𝑘-2𝑥_𝑘 + ... + 𝜃_𝑘𝑥_𝑘

LR - Methods for Estimating Coefficients (𝜃_𝑖)

Methods estimating unknown coefficients {𝜃₀, ..., 𝜃_𝑘} of 𝐄[𝑌|𝑋₁=𝑥₁, ..., 𝑋_𝑘=𝑥_𝑘] = ℎ(𝑥₁, ..., 𝑥_𝑘) = 𝑦̂ = 𝜃₀+ 𝜃₁𝑓₁(𝑥₁, ..., 𝑥_𝑘) + ... + 𝜃_𝑘𝑓_𝑘(𝑥₁, ..., 𝑥_𝑘)

Method	Description
Method of Least Squares (Gradient Descent)	idea: minimizing square error via GRADIENT DESCENT need to choose learning rate 𝛼 need many iterations works well when the number of training examples 𝑋 is large
Method of Least Squares (Projection Matrix - Normal Equation)	idea: minimizing square error via NORMAL EQUATIONS no need to choose learning rate 𝛼 do not need to iterate need to compute (𝑋^𝑇𝑋)^-1𝑋^𝑇 or 𝑉𝐷^-1𝑈^𝑇 slow if the number of training examples 𝑋 is large because computing the inverse of a matrix is 𝑂(𝑛³)
Maximum Likelihood Estimation	idea: maximize the likelihood/likelihood function For OLS to be mathematically equivalent to MLE, the errors are assumed to be normally distributed and Independent and Identically Distributed (IID)
MAP (Bayesian Linear Regression)	idea: maximize the posterior
Newton-Raphson (N-R) Technique	idea: TODO

LR - Model Types

Linear Regression Models - takes an input vector 𝑥∊ℝ^𝑛 as input and predicts the value of a scalar 𝑦∊ℝ as output (whose function/estimator is linear wrt the regression coefficients {𝜃₀, ..., 𝜃_𝑝})

Linear Model Type	Description
Ordinary Least Squares Regression	has several weaknesses, including sensitivity to both outliers and multicollinearity and it is prone to overfitting
Stepwise Regression Best Subsets Regression	these automated methods can help identify candidate regressors early in the model specification process
Robust Regression	applicable in all cases where Ordinary Least Squares (OLS) Regression can be used applies re-weighting to reduce outlier influence
Ridge Regression	address multicollinearity allows you to analyze data even when severe multicollinearity is present and helps prevent overfitting. This type of model reduces the large, problematic variance that multicollinearity causes by introducing a slight bias in the estimates. The procedure trades away much of the variance in exchange for a little bias, which produces more useful coefficient estimates when multicollinearity is present
Lasso Regression (Least Absolute Shrinkage and Selection Operator)	performs variable selection that aims to increase prediction accuracy by identifying a simpler model. It is similar to Ridge regression but with variable selection
Elastic Net Regression	is a combination of regularizers Ridge regression and LASSO regression
Partial Least Squares (PLS) Regression	is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. PLS decreases the independent variables down to a smaller number of uncorrelated components, similar to Principal Component Analysis. Then, the procedure performs linear regression on these components rather than the original data. PLS emphasizes developing predictive models and is not used for screening variables. Unlike OLS, you can include multiple continuous dependent variables. PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables
Beta Regression	models variables within (0, 1) range
Dirichlet Regression	models compositional data
Loess Regression	smoothing time series
Isotonic Regression	for approximation of data that can only increase (typically cumulative data)

LR - Methods for Determining How Well The Fitted Line Describes the Data

Model - Performance/Accuracy/Evaluation/Goodness-of-Fit Measures/Metrics/Analysis

Linear Regression (LR) Models

Linear Regression (LR) Models

LR - Steps

LR - Types

LR - Methods for Estimating Coefficients (𝜃_𝑖)

LR - Model Types

LR - Methods for Determining How Well The Fitted Line Describes the Data

LR - Methods for Diagnosing Bias Variance

LR - Subpages

LR - Resources

Linear Regression (LR) Models

LR - Steps

LR - Types

LR - Methods for Estimating Coefficients (𝜃𝑖)

LR - Model Types

LR - Methods for Determining How Well The Fitted Line Describes the Data

LR - Methods for Diagnosing Bias Variance

LR - Subpages

LR - Resources

LR - Methods for Estimating Coefficients (𝜃_𝑖)