Chapter 1 Linear regression


1.1 Introduction

1.1.1 Machine Learning

Machine Learning user view

Implementing a Machine Learning technique in a real-world engineering setting usually (at least) implies:

  1. Realizing the application has uncertainty and cannot be solved by deterministic methods. (Engineering)
  2. Determining if there are enough data to use a data-driven algorithm. (Engineering -> Statistics)
  3. Modelling the application as a standard problem. (Engineering -> Mathematics)
  4. Choosing a particular solution algorithm. (Mathematics -> Computing)
  5. Adjusting the algorithm by setting some hyperparameters. (Mathematics -> Heuristics)
  6. Assessing and testing results. (Computing -> Engineering)

Real-world problems in Machine Learning

Applying Machine Learning techniques is useful for tasks that contain uncertainty:

Standard tasks in Machine Learning

  • Supervised learning
    • Regression
    • Classification
  • Unsupervised learning
    • Clustering
  • Temporal series prediction

1.1.2 The R language and the RStudio environment

  • These notes are made with R Markdown, which must be used for the exercises of the assignments
  • The R language has support for:
    • Vectors
    • Matrices
    • Data frames (a subtype of list)
    • Graphics
    • User-defined functions
    • Control flow structures (if, for, while, …)
  • There are many packages that extend R capabilities fordput specialized tasks

1.1.3 Data Sources

  • R has a lot of built-in example datasets:
data()

1.1.4 Math recall: Probabilities

  • A probability is a function of events \(A_1, A_2,\dots\) in a set
  • Each event \(A_i\) has a probability \(p_i\): \(P(A_i) = p_i\)
  • Probabilities are between 0 and 1: \(0\le p_i\le 1\)
  • All probabilities add up to 1: \(\sum_i p_i = 1\)
  • A key concept is independence: \(P(A \text{ y } B) = P(A) P(B)\)
  • Conditional probability: \(P(A|B)=\frac{P(A \text{ y } B)}{P(B)}\)
  • Independence: \(P(A|B)=P(A)\)
  • Correlation measures independence (but only linear!!):
data("mtcars")
mtcars$l100km <- 235.215 / mtcars$mpg
with(mtcars, cor(mpg, l100km))

1.1.5 Exploratory Data Analysis and Descriptive Statistics

Before any learning task, we must have an overall idea of our data: this is descriptive statistics

data("cars")
head(cars)
summary(cars)
plot(cars$speed, cars$dist)
?mtcars
# Visualize
plot(mtcars$wt, mtcars$mpg)
hist(mtcars$cyl)
# Transform
mtcars$l100km <- 235.215 / mtcars$mpg
with(mtcars, plot(wt, l100km))  # plot(mtcars$wt, mtcars$l100km)
mtcars[mtcars$cyl=='4', ]
mtcars[, 1:3]
# Exploratory Data Analysis
summary(mtcars)
mean(mtcars$l100km)
with(mtcars, cor(wt, l100km))
with(mtcars, cor(am, l100km))   # With qualitative variables????
any(is.na(mtcars))
# Basic methods are linear
print(mean(mtcars$mpg))
print(235.215 / mean(mtcars$mpg))
print(mean(mtcars$l100km))
with(mtcars, cor(mpg, l100km))

1.1.6 Stochastic simulation

  • We can also generate our own data
  • This can simulate some physical phenomenon, e.g. noise
# number of samples
n = 20
# Voltage (volts)
voltage <- seq(1, length.out=n)
# Resistance (ohms)
resistance <- 50
# "Measure" current
current <- voltage / resistance
# Add noise
noise <- runif(n, min=0.9, 1.1)
current <- current * noise
# Build a data frame
measures = data.frame(voltage, current)
with(measures, plot(voltage, current))

1.1.7 Exercises

  1. Download some of the examples from Kaggle or UCI
  2. Describe each of the variables
  3. Check the existence of categorical (qualitative) variables (factors)
  4. Check the existence of missing values (NA)
  5. Plot some of the numeric values
  6. Plot the histogram of some of the categorical variables
  7. Compute the average and variance of the first 1000 natural numbers

1.2 Linear regression

  • A linear regression is a statistical model
  • The model is a relation between two variables, \(x\), \(y\): \(y=F(x)\)
  • Variables are not symmetric:
    • One variable (\(x\)) is visible, explanatory.
    • One variable (\(y\)) is hidden, the response.
  • The aim is to predict \(y\) from \(x\)
  • Both \(x\) and \(y\) are assumed to come from a probability distribution

1.2.1 Regression models

  1. Define a family of models
  2. Generate a fitted model
data("cars")
modelo <- lm(dist ~ speed, data=cars)
summary(modelo)
  1. The R command lm makes a linear model:

\[\text{dist ~ speed} \Rightarrow \qquad \text{dist}= \theta_0+\theta_1\, \text{speed}\]

  1. Parameters: \(\theta_0,\theta_1\)
  2. Fitting is computing the optimal value of parameters
    • \(\theta_0\) is the intercept
    • \(\theta_1\) is the coefficient of the variable speed
  3. Analyze the residuals
    • Prediction \(\hat{y_i}=\theta_0+\theta_1\, x_i\)
    • Error or residual: \(E_i=y_i-\hat{y_i}\)
data("cars")
modelo <- lm(dist ~ speed, data=cars)
with(cars, plot(speed, dist))
# Prediction: y_hat = modelo$coefficients[1] + modelo$coefficients[2] * cars$speed
y_hat = predict(modelo)
lines(cars$speed, y_hat, col="red")
plot(modelo$residuals)

1.2.2 Math recall: Regression and probabilities

This field is often called Statistical Learning

  • The key of dependence is that provides information
  • Regression is based in \(x\), \(y\) being dependent
  • Data \(x\) is a sample from a probability distribution \(X\) (unknown)
  • Regression estimates the mean of \(P(Y|X=x)\)
  • Correlation does not imply causation!
  • The algorithm is not to blame if data is biased.

1.2.3 Multilinear regression

The linear regression model is easily extended to several explanatory variables:

  1. The R command lm makes a multilinear model: \(mpg = \theta \cdot (\text{disp, drat, wp})\)

  2. Parameter vector: \(\theta\)

  3. Fitting is computing the optimal value of parameters

    • \(\theta_0\) is the intercept
    • \(\theta_1\dots \theta_d\) are the coefficients of the variables
  4. Analyze the residuals for the input vector \(x\):

    • Prediction \(\hat{y_i}=\theta \cdot x_i\)
    • Error or residual: \(E_i=y_i-\hat{y_i}\)
data("mtcars")
modelo <- lm(mpg ~ disp + drat + wt, data=mtcars)
summary(modelo)
plot(modelo$residuals)
  • The signs of coefficients indicate how a variable affects the response
  • The values of coefficients measure the relative importance of variables
data("mtcars")
modelo_1v <- lm(mpg ~ disp, data=mtcars)
modelo_3v <- lm(mpg ~ disp + drat + wt, data=mtcars)
print(sum(modelo_1v$residuals^2) / nrow(mtcars))
print(sum(modelo_3v$residuals^2) / nrow(mtcars))
  • We must compare if including more variables improves (decreases) the residual error

1.2.4 Math recall: Regression and optimization

Mathematically, the solution of regression is an optimization problem:

  • Model: \(y=\theta_0+\theta_1\, x\)

  • Parameters or weights: \(\theta_0,\, \theta_1\)

  • Error function, cost function or residual: \[ E_{\{x,y\}}(\theta_0,\theta_1) = \sum_i \left(y_i-\theta_0-\theta_1\, x_i \right)^2 \]

  • Solution: \[\min_{\theta_0,\theta_1} \sum_i \left(y_i-\theta_0-\theta_1\, x_i \right)^2\]

1.2.5 Exercises

  1. Download some of the examples from Kaggle or UCI
  2. Guess, according to the problem logic, which is the response variable \(y\)
  3. Select one of the continuous variables \(x\)
  4. Compute the linear regression y ~ x
  5. Compute the error (sum of squared residuals)
  6. Plot the data and the linear regression
  7. Plot the residuals
  8. Select one or more additional variables and compute multilinear regression
  9. Compute the error (sum of squared residuals)
  10. Discuss the relative influence of variables

1.3 Goodness of fit

1.3.1 Training error

  • The fitted model gives a prediction \(\hat{y_i}=F_\theta(x_i)\)
  • We measure the goodness of fit by the training error: \(\sum_i (y_i-\hat{y_i})^2\)
  • Other names:
    • Sum of Squared Errors (SSE)
    • Sum of Squared Residuals
  • We can use other measures, but we have to be consistent (always use the same), e.g.: Mean Squared Error = SSE / n
  1. We can then decrease the training error by using a model with higher capacity.

1.3.2 Polynomial regression

  • We can make a nonlinear model:

\[ y= \theta_0+\theta_1 x + \theta_2 x^2 + \dots + \theta_p x^p \]

  • This model has higher capacity, number of parameters, etc.
  • It is expected that this model produces a smaller (training) error
  • But the model is still linear in the parameters
  • The parameters can be computed in R with lm
data("cars")
modelo_lineal <- lm(dist ~ speed, data=cars)
modelo_cuad <- lm(dist ~ speed + I(speed^2), data=cars)
with(cars, plot(speed, dist))
lines(cars$speed, predict(modelo_lineal), col="red")
lines(cars$speed, predict(modelo_cuad), col="green")
cat("Error lineal = ", sum(modelo_lineal$residuals^2) / nrow(cars))
cat("Error cuadrático = ", sum(modelo_cuad$residuals^2) / nrow(cars))
  • Additional regressors can also be included in multilinear regression
data("mtcars")
modelo_1v <- lm(mpg ~ disp, data=mtcars)
modelo_3v <- lm(mpg ~ disp + drat + wt, data=mtcars)
modelo_3v_poly <- lm(mpg ~ disp + drat + wt + I(disp^2) + I(drat^2) + I(wt^2), data=mtcars)
print(sum(modelo_1v$residuals^2) / nrow(mtcars))
print(sum(modelo_3v$residuals^2) / nrow(mtcars))
print(sum(modelo_3v_poly$residuals^2) / nrow(mtcars))
summary(modelo_3v_poly)
  • This is equivalent to create new columns: \(x_2=x^2, x_3=x^3\), etc.
data("cars")
X <- cars$speed
for (i in 2:5) {
  X <- cbind(X, cars$speed^i)
}
modelo <- lm(dist ~ X, data=cars)
print(sum(modelo$residuals^2) / nrow(cars))
plot(cars$speed, cars$dist)
lines(cars$speed, predict(modelo), col="red")
  • You can also add terms that are different from polynomials
data("mtcars")
# Nueva columna
mtcars$l100km <- 235.215 / mtcars$mpg
modelo_1v <- lm(mpg ~ disp, data=mtcars)
modelo_1v_l100km <- lm(l100km ~ disp, data=mtcars)
print(sum(modelo_1v$residuals^2) / nrow(mtcars))
print(sum(modelo_1v_l100km$residuals^2) / nrow(mtcars))
with(mtcars, plot(disp, mpg))
lines(mtcars$disp, predict(modelo_1v))
with(mtcars, plot(disp, l100km))
lines(mtcars$disp, predict(modelo_1v_l100km))
  • If columns are not linearly independent there are numerical instabilities
data("cars")
cars$speedkmh <- 1.60934 * cars$speed
# . = Resto de columnas!!
modelo <- lm(dist ~ ., data=cars)
with(cars, plot(speed, dist))
summary(modelo)
  • In principle, columns can be added up to arbitrary degrees
# Try different degrees 10, 20, etc.
max_degree = 10
#
data("cars")
X <- cars$speed
for (i in 2:max_degree) {
  X <- cbind(X, cars$speed^i)
}
modelo <- lm(dist ~ X, data=cars)
print(sum(modelo$residuals^2) / nrow(cars))
plot(cars$speed, cars$dist)
lines(cars$speed, predict(modelo), col="red")
summary(modelo)

1.3.3 Math recall: Regression and Linear Algebra

It is easier to use matrix notation, in particular when we have multilinear models. Based on a data matrix \(\tilde{x}=(1,x)\):

  • Model: \(y= \tilde{x} \, \theta\)

  • Parameters or weights: vector \(\theta\)

  • Error function, cost function or residual: \[ E_{\{\tilde{x},y\}}(\theta) = \sum_i \left(y_i-\theta\, \tilde{x}_i \right)^2 = \|Y-X\, \theta\|\]

  • Solution: \[\min_{\theta} \sum_i \left(y_i-\theta \, \tilde{x}_i \right)^2 = \min_{\theta} \|Y-X \,\theta\|^2 \]

  • The solution for the linear model \(Y=X\,\theta\) is given by the normal equations: \[\theta^*=(X^\top\,X)^{-1}\,X^\top\,Y\] where \(X^+=(X^\top\,X)^{-1}\,X^\top\) is the Moore-Penrose pseudo-inverse.

  • If the matrix \(X\) has \(m\) columns, the size of \((X^\top\,X)^{-1}\) is \(m\times m\)

  • The effect of adding columns is increasing \(m\)

  • If we add columns up to \(m=n\), then the matrix \(X\):

    • is square
    • is invertible if its columns are linearly independent (full rank)
  • If \(X\) is invertible, the system \(Y=X\theta\) has exact solution and the error is zero

  • If we add columns up to \(m>n\), then

    • \(rank(X)\le n < m\)
    • \(rank((X^\top\,X)^{-1})=rank(X)< m\)
    • \((X^\top\,X)^{-1}\) is singular (not invertible)

Example: the matrix \(X\) of the cars database is \(n\times m\)

  • Columns \(m=2\): speed and 1 (\(\theta_0\cdot 1=\theta_0\) is the intercept)
  • Rows \(n=50\)
  • So its rank is…
  • But there are repeated (linearly dependent) rows: \(rank(X)<n\)
  • In practice, we will see numerical instabilities much before reaching \(m=n\)
# Try different degrees 10, 20, etc.
max_degree = 10
#
data("cars")
X <- cars$speed
for (i in 2:max_degree) {
  X <- cbind(X, cars$speed^i)
}
rankMatrix(X, method="qr.R")

1.3.4 Exercises

  1. Download some of the examples from Kaggle or UCI
  2. Select several continuous columns to perform multilinear regression
  3. Try at least three combinations of higher order terms
  4. For each combination of regressors, check the training error
  5. By adding too many columns, try to find a combination where singularities appear

1.4 Generalization and overfitting

1.4.1 Training error

  • The quality of a model is measured by the training error: \(\sum_i (y_i -\hat{y_i})^2\)
  • A model with high training error is underfitting
  • In theory, training error tends to zero by using a basis of funtions, e.g. polynomials (Stone-Weierstra theorem)
  • Statistical framework: data comes from sampling a probability distribution
# Ohm's law is an empirical law
# number of samples
n_samples <- 20
# Voltage (volts) (sort only for plotting)
voltage <- runif(n_samples, min=50, max=100)
voltage <- sort(voltage)
# Resistance (ohms)
resistance <- 50
# "Measure" current
current <- voltage / resistance
# Add noise
noise <- runif(n_samples, min=0.8, 1.2)
current <- current * noise
# Build a data frame
measures <- data.frame(voltage, current)
with(measures, plot(voltage, current))
# Compute a linear model
modelo = lm(current ~ voltage + 0, data=measures)
print(sqrt(sum(modelo$residuals^2) / nrow(measures)))
lines(voltage, predict(modelo), col="red")
# Increase model capacity
degree = 20
for (i in 2:degree) {
    measures <- cbind(measures, voltage^i / max(voltage^i))
    colnames(measures)[i+1] <- paste("voltage", i, sep="")
}
modelo_poly = lm(current ~ ., data=measures)
print(sqrt(sum(modelo_poly$residuals^2) / nrow(measures)))
with(measures, plot(voltage, current))
lines(voltage, predict(modelo_poly), col="red")

1.4.2 Test error

  • Once the model is built, the final aim is prediction
  • We have to check the error on data that has not been used for regression
  • We can simulate these data by a test set
  • If the model gives low training error and high test error, it is overfitting

The key method to check for overfitting is to divide the dataset into training and validation sets and check for validation errors rather than test errors.

# Ohm's law is an empirical law
# number of samples
n_samples <- 40
voltage <- runif(n_samples, min=50, max=100)
# Resistance (ohms)
resistance <- 50
# "Measure" current
current <- voltage / resistance
# Add noise
noise <- runif(n_samples, min=0.8, 1.2)
current <- current * noise
# Build a data frame
measures <- data.frame(voltage, current)
# Augment data
degree = 7
for (i in 2:degree) {
    measures <- cbind(measures, voltage^i / max(voltage^i))
    colnames(measures)[i+1] <- paste("voltage", i, sep="")
}
# Separate training and test datasets
indices = caret::createDataPartition(measures$current, list=FALSE)
measures_train = measures[indices,]
measures_test = measures[-indices,]
# Compute linear and polynomial model
modelo = lm(current ~ voltage + 0, data=measures_train)
modelo_poly = lm(current ~ ., data=measures_train)
# Compare errors
cat("Train / lineal:", sqrt(sum(modelo$residuals^2) / nrow(measures_train)), "\n")
cat("Train / poli:", sqrt(sum(modelo_poly$residuals^2) / nrow(measures_train)), "\n")
predict_lineal = predict(modelo, newdata=measures_test)
predict_poly = predict(modelo_poly, newdata=measures_test)
error_test_lineal = sum((measures_test$current - predict_lineal)^2 / nrow(measures_test))
error_test_poly = sum((measures_test$current - predict_poly)^2 / nrow(measures_test))
cat("Test / lineal:", error_test_lineal, "\n")
cat("Test / poli:", error_test_poly, "\n")
  • When data is limited, we should try different partitions and average results (resampling)
  • Setting the proportion of data used for test is heuristic
data("mtcars")
# Partition data
indices = caret::createDataPartition(mtcars$mpg, p=0.7, list=FALSE)
mtcars_train = mtcars[indices,]
mtcars_test = mtcars[-indices,]
# Fit the model
modelo <- lm(mpg ~ disp + drat + wt, data=mtcars_train)
# Check errors
cat("Train:", sqrt(sum(modelo$residuals^2) / nrow(mtcars_train)), "\n")
prediction = predict(modelo, newdata=mtcars_test)
error_test = sum((mtcars_test$mpg - prediction)^2 / nrow(mtcars_test))
cat("Test:", error_test, "\n")

1.5 Assignment

  1. Select a real-world problem
  2. Explain the problem as a regression task and the expected result
  3. Download and describe data. Classify columns as numerical or categorical
  4. Select the features that will be used in the regression
  5. Partition data into training and test datasets
  6. Perform regression by using the lm R command.
  7. Analyze both types of error.
  8. Change the model, e.g. adding polynomial terms and check what happens to both errors.