Chapter 2 Neural networks

2.1 Neural networks as a regression model

2.1.1 Introduction

Neural Networks are nonlinear regression models
Neural networks have higher capacity than linear models
- Many (thousands or millions) parameters
- Avoid underfitting
Neural networks handle overfitting better than linear models, but
- nobody knows why,
- this is not always true, and
- overfitting is still a problem in neural networks
Neural networks have a long history
- 1950s: models based on biological neurons
- 1960s: multilayer perceptron
- 1980s: backpropagation. Neural networks are omnipresent in academic research
- 2010s: deep learning. Neural networks are omnipresent

2.1.2 Artificial neurons

Recall the linear regression model: \[ y=b+\sum_i x_i w_i \]
It can be viewed as an input-output model:

If we post-process the output by a non-linear function \(f\) then we have an artificial neuron: \(y=\varphi\left(b+\sum_i x_i w_i \right)\)
The function \(f\) is the activation or transfer function
- Linear function: \(\varphi(z)=k\, z\)
- Sigmoid (logistic) function: \(\varphi(z)=\frac{1}{1+e^{-z}}\)
- Softmax function (for multiple outputs)
- Rectified linear units (ReLU): \(\varphi(z)=\max(0,z)\)

2.1.3 Artificial neural networks

A neural network is composed by many neurons arranged in layers
- There is an input layer: \(y^1=x\)
- Outputs of neurons \(y^l\) at layer \(l\) (hidden layers) are the inputs of all neurons at the next layer \(l+1\)
- The output of the network \(\hat{y}\) is the output of the neuron \(y^N\) at the last layer \(L\) (output layer)
Connections are one-directional: there are no loops

All neurons include a bias or intercept, that can be thought of as a neuron with output fixed to 1 (not shown in the figure)
Computation proceeds in feed-forward direction, layer by layer

\[\begin{aligned} y^0&=x \\ y^1&=\varphi\left(b^{1}+W^1\,y^{0}\right)\\ \dots&=\dots \\ y^l&=\varphi\left(b^{l}+W^l\,y^{l-1}\right) \\ \dots&=\dots \\ \hat{y}&=y^{L} \end{aligned}\]

2.1.4 Neural Networks in R

There are many packages for neural networks, see a comparison
We will use nnet (for now)

Using nnet is very similar to lm: it uses the same formula syntax y ~ x
For using nnet we have to set the size of the hidden layer

data("cars")
modelo_lineal <- lm(dist ~ speed, data=cars)
modelo_neuronal <- nnet(dist ~ speed, data=cars)

2.1.5 Design of a neural network

Network design = setting hyper-parameters
- We have to set the number of layers
- We have to set the number of neurons at each hidden layer
  - The input layer must match the number of inputs or regressors
  - The output layer has one neuron for regression
- We have to set the activation functions
  - Usually sigmoid in hidden layers
  - At the output layer, it must match the possible outputs:
    - Sigmoid functions are bounded
    - Use linear output to allow arbitrary large outputs
    - Sigmoid functions can be used if data is normalized
Network fitting: the computational cost depends on the network size
- Computation of layer \(l\): \(y^l=\varphi\left(b^{l}+W^l\,y^{l-1}\right)\)
- The matrix \(W^l\) has dimensions \(N_l \times N_{l-1}\) where \(N_l\) is the number of neurons of layer \(l\)
- There are \(N_l \times (N_{l-1}+1)\) parameters to be learned at layer \(l\)
Evaluate errors
- The concepts of training error and test error are as important as in linear regression (or more)
- The error is only computed at the output layer: \[ E_{\{x,y\}}(b^1,W^1,\dots,b^L,W^L) = \|y-\hat{y}\|^2 \]
- But the error depends on parameters at all layers

2.1.6 Examples

Compare graphically the fit produced by a neural network (size=10) and a linear model. We will talk about maxit in next class, for now just try that nnet finishes with the message converged

data("cars")
# Hyperparameters
# One hidden layer, ten units: size=10
# What happens if we don't use linout=FALSE??
# We will talk about maxit in next class
modelo_neural <- nnet(dist ~ speed, data=cars, size=10, linout=TRUE, maxit=1000)
with(cars, plot(speed, dist))
modelo_lineal <- lm(dist ~ speed, data=cars)
lines(cars$speed, predict(modelo_lineal), col="red")
lines(cars$speed, predict(modelo_neural), col="green")

Analyse the fitting

data("cars")
# Hyperparameters
# One hidden layer, ten units: size=10
modelo_lineal <- lm(dist ~ speed, data=cars)
modelo_neural <- nnet(dist ~ speed, data=cars, size=10, linout=TRUE, maxit=1000)
summary(modelo_neural)

Compute the errors

# WARNING: This is only an example. In practice always partition training/test
data("cars")
# Hyperparameters
# One hidden layer, ten units: size=10
modelo_lineal <- lm(dist ~ speed, data=cars)
modelo_neural <- nnet(dist ~ speed, data=cars, size=10, linout=TRUE, maxit=1000)
cat("Error modelo lineal = ", sum(modelo_lineal$residuals^2) / nrow(cars), "\n")
cat("Error red neuronal = ", sum(modelo_neural$residuals^2) / nrow(cars), "\n")

Results are not deterministic
Try normalization (linout=FALSE)

# WARNING: This is only an example. In practice always partition training/test
data("cars")
cars$norm_dist = cars$dist / max(cars$dist)
# Hyperparameters
# One hidden layer, ten units: size=10
modelo_lineal <- lm(norm_dist ~ speed, data=cars)
modelo_neural <- nnet(norm_dist ~ speed, data=cars, size=10, maxit=1000)
cat("Error modelo lineal = ", sum(modelo_lineal$residuals^2) / nrow(cars), "\n")
cat("Error red neuronal = ", sum(modelo_neural$residuals^2) / nrow(cars), "\n")

Always check training and test errors

data("mtcars")
# Partition data
indices = caret::createDataPartition(mtcars$mpg, p=0.7, list=FALSE)
mtcars_train = mtcars[indices,]
mtcars_test = mtcars[-indices,]
# Fit the models
modelo_1 <- lm(mpg ~ disp + drat + wt, data=mtcars_train)
modelo_2 <- nnet(mpg ~ disp + drat + wt, data=mtcars_train, size=100, linout=TRUE, maxit=1000)
# Check errors
cat("Training error (linear model):", sqrt(sum(modelo_1$residuals^2) / nrow(mtcars_train)), "\n")
prediction_1 = predict(modelo_1, newdata=mtcars_test)
error_test_1 = sum((mtcars_test$mpg - prediction_1)^2 / nrow(mtcars_test))
cat("Test error (linear model):", error_test_1, "\n")
cat("Training error (neural network):", sqrt(sum(modelo_2$residuals^2) / nrow(mtcars_train)), "\n")
prediction_2 = predict(modelo_2, newdata=mtcars_test)
error_test_2 = sum((mtcars_test$mpg - prediction_2)^2 / nrow(mtcars_test))
cat("Test error (neural network):", error_test_2, "\n")

Check different network sizes (it will probably require MaxNWts and adjusting maxit)

# Try changing network size
data("mtcars")
# Partition data
indices = caret::createDataPartition(mtcars$mpg, p=0.7, list=FALSE)
mtcars_train = mtcars[indices,]
mtcars_test = mtcars[-indices,]
# Fit the models
modelo_1 <- lm(mpg ~ disp + drat + wt, data=mtcars_train)
modelo_2 <- nnet(mpg ~ disp + drat + wt, data=mtcars_train, size=1000, linout=TRUE, MaxNWts=10000, maxit=1000)
# Check errors
cat("Training error (linear model):", sqrt(sum(modelo_1$residuals^2) / nrow(mtcars_train)), "\n")
prediction_1 = predict(modelo_1, newdata=mtcars_test)
error_test_1 = sum((mtcars_test$mpg - prediction_1)^2 / nrow(mtcars_test))
cat("Test error (linear model):", error_test_1, "\n")
cat("Training error (neural network):", sqrt(sum(modelo_2$residuals^2) / nrow(mtcars_train)), "\n")
prediction_2 = predict(modelo_2, newdata=mtcars_test)
error_test_2 = sum((mtcars_test$mpg - prediction_2)^2 / nrow(mtcars_test))
cat("Test error (neural network):", error_test_2, "\n")

2.1.7 Exercises

Download some of the examples from Kaggle or UCI
Compare multilinear regression with a neural network
Check training and test errors
Change the network size and check errors again

2.2 Backpropagation learning

2.2.1 Training error in neural networks

Recall the general loop of learning techniques:
1. Define a family of parameterized models
2. Fit the model by computing the optimal parameters
3. Check the errors
The output of neural networks results from feed-forward computation by layers:

\[\begin{aligned} y^0&=x \\ y^1&=\varphi\left(b^{1}+W^1\,y^{0}\right)\\ \dots&=\dots \\ y^l&=\varphi\left(b^{l}+W^l\,y^{l-1}\right) \\ \dots&=\dots \\ \hat{y}&=y^{L} \end{aligned}\]

The training error is a very complex function of all parameters: \(E=\sum_{i=1}^n (y_i-\hat {y_i})^2\)

2.2.2 Error minimization in neural networks

The minimization problem \(\min_W E\) has no direct solution: we need numerical algorithms
The basic idea of most numerical optimization methods is gradient descent: \(w_{ij}^l(t) = w_{ij}^l(t-1)- \alpha \, \nabla E(w_{ij}^l(t-1))\)
The hyperparameter \(\alpha\) is called learning rate
Numerical methods (mostly) are iterative algorithms. They require:
- Initialization
- Step adjustment
- Convergence check
Historical note: gradient computation in neural networks was called backpropagation

2.2.3 Gradient descent in R

The package nnet uses an automatic gradient preconditioning (BFGS), it does not need setting \(\alpha\)
Increase maxit if the network does not converge
Check the evolution of the error during execution (trace=TRUE)
We can manipulate the stop condition (be careful):
- \(E< abstol\)
- \(E_{n+1}-E_n \ge (1-reltol)\)
Weight initialization: random in \([-rang,rang]\)

# Try changing network size
data("mtcars")
# Partition data
indices = caret::createDataPartition(mtcars$mpg, p=0.7, list=FALSE)
mtcars_train = mtcars[indices,]
mtcars_test = mtcars[-indices,]
# Fit the models
modelo_1 <- lm(mpg ~ disp + drat + wt, data=mtcars_train)
modelo_2 <- nnet(mpg ~ disp + drat + wt, data=mtcars_train, size=100, linout=TRUE, MaxNWts=10000, maxit=1000)
# Check errors
cat("Training error (linear model):", sqrt(sum(modelo_1$residuals^2) / nrow(mtcars_train)), "\n")
prediction_1 = predict(modelo_1, newdata=mtcars_test)
error_test_1 = sum((mtcars_test$mpg - prediction_1)^2 / nrow(mtcars_test))
cat("Test error (linear model):", error_test_1, "\n")
cat("Training error (neural network):", sqrt(sum(modelo_2$residuals^2) / nrow(mtcars_train)), "\n")
prediction_2 = predict(modelo_2, newdata=mtcars_test)
error_test_2 = sum((mtcars_test$mpg - prediction_2)^2 / nrow(mtcars_test))
cat("Test error (neural network):", error_test_2, "\n")

For deterministic experiments, set the initial weights (Wts)

n_pesos <- length(modelo_2$wts)
pesos <- runif(n_pesos, min=-0.7, 0.7)
for (i in (1:10)) {
  modelo_determ <- nnet(mpg ~ disp + drat + wt, data=mtcars_train, size=100, linout=TRUE, MaxNWts=10000, maxit=1000, Wts=pesos, trace=FALSE)
  print(sum(modelo_determ$residuals^2) / nrow(mtcars))
}

Output can be multidimensional
This is more efficient than two separate regressions, because weights in the hidden layers are shared
Warning: this is a long execution

# Database loaded externally
# https://www.kaggle.com/datasets/elikplim/eergy-efficiency-dataset
# Graphical interface: use headers row = YES !!
datos <- read.csv("ENB2012_data.csv")
# Partition data (I do not have caret)
# indices = caret::createDataPartition(mtcars$mpg, p=0.7, list=FALSE)
indices = sample(1:nrow(datos), 0.7*nrow(datos))
datos_train = datos[indices,]
datos_test = datos[-indices,]
# Fit the models
X = datos_train[, 1:8]
Y = datos_train[, 9:10]
modelo_md <- nnet(X, Y, size=1000, linout=TRUE, MaxNWts=100000, maxit=1000)
# Check errors
cat("Training error (neural network):", sqrt(sum(modelo_md$residuals^2) / nrow(datos_train)), "\n")
X_test = datos_test[, 1:8]
Y_test = datos_test[, 9:10]
prediction = predict(modelo_md, newdata=X_test)
error_test = sum((Y_test - prediction)^2) / nrow(datos_test)
cat("Test error (neural network):", error_test, "\n")

Observe the very slow error decreasing

iter 900 value 1609.212529
iter 910 value 1608.937640
iter 920 value 1608.811216
iter 930 value 1608.745116
iter 940 value 1608.636313
iter 950 value 1608.473398
iter 960 value 1608.328241
iter 970 value 1608.193417
iter 980 value 1608.122278
iter 990 value 1608.036335
iter1000 value 1607.998032
final  value 1607.998032 
stopped after 1000 iterations
Training error (neural network): 1.730436 
Test error (neural network): 2.739645

2.2.4 Exercises

Download some of the examples from Kaggle or UCI
Build neural networks of different sizes
Check training and test errors
Check the number of parameters for each network size

2.3 Regularization

Recall the critical concept of capacity:
- Too few parameters: large training error (underfitting)
- Too many parameters: large test error (overfitting)
Regularization techniques are methods to avoid overfitting without losing (much) capacity

2.3.1 Ridge regression

Linear regression can be expanded by including polynomial terms
Problem of high-order polynomial models: very large parameters (NA)
We can penalize large parameters in the optimization: \(E=\|y-\hat{y}\|^2+\lambda\, \|\theta\|^2\)
The optimization problem is still LIP (Linear In the Parameters)
Including the norm of the parameters in the target function is called:
- Tikhonov regularization (or L2 regularization)
- Ridge regression (when applied to linear regression)
We can use ridge regression in R with the package glmnet

# Ohm's law is an empirical law
# number of samples
n_samples <- 100
# Voltage (volts) (sort only for plotting)
voltage <- runif(n_samples, min=50, max=100)
voltage <- sort(voltage)
# Resistance (ohms)
resistance <- 50
# "Measure" current
current <- voltage / resistance
# Add noise
noise <- runif(n_samples, min=0.8, 1.2)
current <- current * noise
# Build a data frame
measures <- data.frame(voltage, current)
# Compute polynomial terms
degree = 7
for (i in 2:degree) {
    measures <- cbind(measures, voltage^i / max(voltage^i))
    colnames(measures)[i+1] <- paste("voltage", i, sep="")
}
# Stadardize data
measures = as.data.frame(scale(measures))
# Fit a polynomial model
modelo_poly <- lm(current ~ ., data=measures)
# Add regularization
measures_x <- as.matrix(measures[, -2])
measures_y <- as.matrix(measures$current)
# Find the best lambda
modelo_cv <- cv.glmnet(measures_x, measures_y, alpha=0)
lambda_best <- modelo_cv$lambda.min
# Fit the final model
modelo_reg <- glmnet(measures_x, measures_y, alpha=0, lambda=lambda_best)
# Print coefficients
cat("Máximo coeficiente:", max(modelo_poly$coefficients), "\n")
cat("Máximo coeficiente (regulariz.):", max(coef(modelo_reg)), "\n")
cat("Error:", sum(modelo_poly$residuals^2) / nrow(measures), "\n")
prediction <- predict(modelo_reg, newx=measures_x)
error_reg <- sum((prediction - measures_y)^2) / nrow(measures)
cat("Error (regulariz.):", error_reg, "\n")

2.3.2 Math recall: algebraic regularization

Recall the least squares solution of linear regression: \(\theta^*=(X^\top X)^{-1}\,X^\top\,y\)
The matrix \((X^\top X)^{-1}\) is singular
- If there are linearly dependent rows in \(X\)
- If there are more columns than rows in \(X\)
The matrix \((X^\top X)^{-1}\) may be ill-conditioned: \(det(X^\top X)\approx 0\)
- If there are many columns in \(X\)
- If there are few rows in \(X\)
The solution of \(\nabla E=0\) with regularization is \(\theta^*=(X^\top X+\lambda\,I)^{-1}\,X^\top\,y\)
The matrix \((X^\top X)^{-1}+\lambda\,I\) is always positive-definite for \(\lambda>0\)

2.3.3 Weight decay

Tikhonov regularization can allo be applied in neural networks: \(E=\|y-\hat{y}\|^2+\lambda\, \|W\|^2\)
The standard gradient descent: \(w_{t+1} = w_{t} - \alpha \, \dfrac{\partial E}{\partial w}\)
Regularization adds a term to the gradient: \(w_{t+1} = w_{t} - \alpha \, \dfrac{\partial E}{\partial w} -\lambda w_t\)
The amount of regularization is related to the bias-variance tradeoff

library("nnet")
# Try changing weight decay
data("mtcars")
# Partition data
indices <- caret::createDataPartition(mtcars$mpg, p=0.7, list=FALSE)
mtcars_train <- mtcars[indices,]
mtcars_test <- mtcars[-indices,]
# Fit the models
modelo <- nnet(mpg ~ disp + drat + wt, data=mtcars_train, size=500, linout=TRUE, MaxNWts=10000, maxit=1000, decay=1000)
# Check error
cat("Training error:", sqrt(sum(modelo$residuals^2) / nrow(mtcars_train)), "\n")
prediction <- predict(modelo, newdata=mtcars_test)
error_test <- sum((mtcars_test$mpg - prediction)^2 / nrow(mtcars_test))
cat("Test error:", error_test, "\n")
cat("Max coefficient:", max(modelo$wts), "\n")

decay=0
Training error: 0.2020726 
Test error: 177.1406 
Max coefficient: 58.74598 

decay=1e-4
Training error: 0.2020858 
Test error: 11.01574 
Max coefficient: 10.88843

2.3.4 Implicit regularization

Cross-validation: check test error and adjust the model hyper-parameters
Early stopping: stop gradient descent before convergence
- Completely optimize training error = overfitting
- In nnet this can be done with maxit or reltol

2.3.5 A (big) data approach to regularization

Ill-conditioned results when data \(X\) has many columns and few rows in \(X\)
To avoid overfitting we could:
- Reduce variables: dimensionality reduction
- Create data instances: data augmentation
We need big amounts of data (that are useless is quality is low)

2.3.6 Exercises

Download some of the examples from Kaggle or UCI
Build neural networks and choose the size
For the same network size, try different values for the parameter decay
Check training and test errors
Set an extremely high value (e.g. decay=100) and see what happens to the error and the parameters

2.4 Classification

In regression tasks, we aim at learning a numeric result: given data X, provide the value \(\hat{y}\)
In classification tasks, we aim at learning a decision rule: given data X, y belongs to class C_i

2.4.1 Qualitative variables

Qualitative variables can only take one of a finite number of values: \(x_i\in\{C_1,\dots C_k\}\)
This is in contrast to (real) numeric values
They can also be called:
- Categorical
- Discrete, in contrast to continuous
- In R, factors
Each of the possible values \(C_j\) can be called:
- Levels
- Classes
- Categories
Example: dataset mtcars

[, 8]   vs  Engine (0 = V-shaped, 1 = straight)
[, 9]   am  Transmission (0 = automatic, 1 = manual)
[,10]   gear    Number of forward gears

Qualitative variables must be numerically coded to be used by computational techniques
Numerical codes have an intrinsic order that may not agree with that of the qualitativa variable
Even if a discrete variable has a numeric meaning, the proportionality may not be respected
Coding binary variables as {0,1} can be interpreted as probabilities
When there are \(k>2\) levels, we can use binary coding on \(k\) binary class indicators

# We interpret an input variable (cylinders) as a factor
data("mtcars")
mtcars$cyl_factor = as.factor(mtcars$cyl)
# Fit the models
library(nnet)
modelo_num <- nnet(mpg ~ cyl + disp + wt, data=mtcars_train, size=100, linout=TRUE, MaxNWts=10000, maxit=1000)
modelo_fac <- nnet(mpg ~ cyl_factor + disp + wt, data=mtcars_train, size=100, linout=TRUE, MaxNWts=10000, maxit=1000)
# Compute number of parameters
cat("Coef names (cyl numeric):", modelo_num$coefnames, "\n")
cat("Parameters (cyl numeric):", length(modelo_num$conn), "\n")
cat("Coef names (cyl factor):", modelo_fac$coefnames, "\n")
cat("Parameters (cyl factor):", length(modelo_fac$conn), "\n")

2.4.2 Methods for classification

In classification, the output variable is categorical
There are many methods for classification:
- Regression as a decision boundary
  - Linear coefficients: logistic regression (glm in R)
  - Neural networks
- Decision trees
- Random forests
- Support Vector Machines
- k-Nearest Neighbours
- etc.
There are many examples in medicine, engineering, economics…
- Differential diagnosis
- Credit scoring
- Spam detection
- Failure prediction
- Recognition of hand-written digits

2.4.3 Classification with regression neural networks

An output qualitative variable can be numerically coded
After coding the qualitative variable, we build a regression model \(\hat{y}=F(x)\)
For a qualitative variable, what is the meaning of \(\hat{y}=0.3\)?
For binary (two-class) classification, predictions can be interpreted as probabilities, leading to a decision boundary:

\[\hat{y}=\begin{cases} 0 & \text{if } F(x) \le \frac{1}{2} \\ 1 & \text{if } F(x) > \frac{1}{2} \end{cases} \]

In R, the class indicators are created automatically by nnet

# Classification with nnet
library(nnet)
data(iris)
# Partition data
indices <- caret::createDataPartition(iris$Species, p=0.7, list=FALSE)
datos_train <- iris[indices,]
datos_test <- iris[-indices,]
# Build and fit the model
modelo = nnet(Species ~ ., data=datos_train, size = 100)
# Check training error
cat("Training error:", sqrt(sum(modelo$residuals^2) / nrow(iris)), "\n")

2.4.4 Class prediction from regression

What is the meaning of a prediction \(\hat{y}\ne 0,1\)?

# Prediction in classification
prediction = predict(modelo, datos_test)
print(prediction[1, ])

Predictions can be interpreted as probabilities
Do not forget to use sigmoid activation functions in the output layer (linout=FALSE)

2.4.5 Regression and classification errors

The final aim is predicting discrete classes
The numeric error is not meaningful
It is more illustrative to check the classification error by looking at the confusion matrix
In some applications, some errors are more important than others, e.g. false positives vs. false negatives in medicine

# Classification with nnet
library(nnet)
data(iris)
# Partition data
indices <- caret::createDataPartition(iris$Species, p=0.7, list=FALSE)
datos_train <- iris[indices,]
datos_test <- iris[-indices,]
# Build and fit the model
modelo = nnet(Species ~ ., data=datos_train, size = 100)
# Check the classification error
print("Training error")
output = predict(modelo, type="class")
conf_mat_train <- table(datos_train$Species, output)
print(conf_mat_train)
print("Test error")
prediction = predict(modelo, datos_test, type="class")
conf_mat_test <- table(datos_test$Species, prediction)
print(conf_mat_test)

2.4.6 Exercises

Download data for a classification problem, for instance:
- Car Evaluation
- Electrical Grid Stability
Build a neural network and search a suitable size
Show the confusion matrix on test data

2.5 Deep Learning

2.5.1 How deep is DL?

In recent years there has been a lot of hype about Deep Learning (DL)
Deep Neural Networks are Neural Networks with many hidden layers
Several heuristics are often used in DL:
- Dropout. Some connections are ignored during learning: regularization
- ReLU units: (avoid gradient vanishing)
- Transfer learning: use weights already trained with different datasets (and maybe fine-tune training with your database)
DL includes some novel architectures
- Convolutional Neural Networks (CNN): images
- Long Short Term Memory (LSTM): time series
Improvements outside Machine Learning theory
- Hardware: GPUs
- Software: e.g. tensorflow, H2O (using keras as interface), fast.ai, torch, etc.
- Funding: Netflix, Google, Facebook…

2.5.2 Deep Learning and torch

This tutorial has been summarized from this, this and this.

Prepare data for torch, that uses tensors

library(torch)
data("iris")
# Partition data
indices <- caret::createDataPartition(iris$Species, p=0.7, list=FALSE)
datos_train <- iris[indices,]
datos_test <- iris[-indices,]
# 
x_train = as.matrix(datos_train[, 1:4])
y_train = as.integer(datos_train$Species) 
# y_train = as.matrix(mltools::one_hot(data.table::as.data.table(datos_train$Species)))
x_test = as.matrix(datos_test[, 1:4])
y_test = as.integer(datos_test$Species)
#y_test = as.matrix(mltools::one_hot(data.table::as.data.table(datos_test$Species)))
# Put into torch tensor format
x_train_tensor = torch_tensor(x_train)
y_train_tensor = torch_tensor(y_train, dtype=torch_long())
x_test_tensor = torch_tensor(x_test)
y_test_tensor = torch_tensor(y_test, dtype=torch_long())

Build the model. Right now we are not training.

Observe the different activation functions:

Sigmoid: \(nn\_sigmoid(x) = \dfrac{1}{1+e^{-x}}\)
REctified Linear Units: \(nn\_relu(x) = max(0,x)\)
Softmax: \(nn\_softmax(x_i) = \dfrac{e^{x_i}}{\sum_j e^{x_j}}\)

modelo = nn_sequential(
    # Layer 1
    nn_linear(4, 100),
    # nn_relu(), 
    nn_sigmoid(),
    
    # Layer 2
    # nn_linear(100, 100),
    # nn_relu(),
    # nn_sigmoid(),

    # Layer 3
    nn_linear(100, 3),
    nn_softmax(dim = 2)  # add to 1 over axis 2 (columns)
    #  nn_sigmoid(),
)

Let us check the architecture has been correctly created

entrada = torch_rand(1, 4)
prediction = modelo$forward(entrada)
print(prediction)

Define cost and optimization method

Different error (loss) functions can be used

nn_mse_loss: for the algorithms used in regression problems
nn_cross_entropy_loss: for classification networks with more than 2 possible categories, when using nn_softmax at the ouptut
See more options here

Different numerical optimization methods can be used. Adam, as all methods based upon Gradient descent, requires adjusting the learning rate

# criterion = nn_nll_loss()
criterion = nn_cross_entropy_loss()
optimizer = optim_adam(modelo$parameters, lr = 0.01)

We can check that the error is computed

for (i in 1:3) {
    salida = torch_tensor(i, dtype = torch_long())
    error = criterion(prediction, salida)
    print(error$item())
}

Train model

epochs = 100
if (! exists("started")) {
    started = TRUE
    iteration = 1
    errors = c()
    accs = c()
    inc_loss = 1
    loss_ant = 1e6
    maxit = epochs
} else {
    maxit = maxit + epochs
}
while (iteration < maxit && inc_loss > 1e-6) {
    optimizer$zero_grad()
    prediction = modelo(x_train_tensor)
    loss = criterion(prediction, y_train_tensor)
    loss$backward()
    optimizer$step()
    # Take care of the loop condition
    iteration = iteration + 1
    inc_loss = abs(loss$item() - loss_ant)
    # Compute and save errors
    winners = prediction$argmax(dim=2)
    corrects = (winners == y_train_tensor)
    accuracy = corrects$sum()$item() / y_train_tensor$size()
    loss_ant = loss$item()
    errors = c(errors, loss$item())
    accs = c(accs, accuracy)
    # Show error evolution
    # cat("Epoch:", iteration, "Loss: ", loss$item(), " Accuracy:", accuracy,"\n")
}
plot(1:iteration, errors[1:iteration], type = "l")
lines(1:iteration, accs[1:iteration])

Check test errors

prediction_test = modelo$forward(x_test_tensor)
error_test = criterion(prediction_test, y_test_tensor)
print(error_test)
winners = as.integer(prediction_test$argmax(dim=2))
conf_mat = table(winners, y_test)
print(conf_mat)

2.6 Assignment

Regression with Neural Networks
1. Select a real-world problem (it may be the same as in the linear regression assignment)
2. Build a neural network for regression
3. Check training and test errors and compare with the linear model
4. Repeat the learning process changing hyper-parameters
  - Two or three values of size
  - Two or three values of weight decay
5. Make a two-way table with errors for all the hyper-parameter combinations
Classification with Neural Networks.
1. Select a real-world problem (or build an artificial example)
2. Build a neural network for classification
3. Show the confusion matrix