Ensemble Learning


In-depth Articles

Ensemble methods are commonly used to improve predictive accuracy by combining predictions from multiple machine learning models. When these methods were conceived, the idea was to combine so-called "weak" predictors (Breiman Leo, "Bagging Predictors" (1996)). However, a more modern approach is to create an ensemble of a well-chosen collection of strong but diverse models (Van der Laan, Polley and Hubbard 2007).
Building powerful ensemble models has many parallels with creating successful teams of people in business, science, politics and sports. Each team member provides a significant contribution, and individual weaknesses and biases are compensated by the strengths of other members.
The simplest type of ensemble is the unweighted average of predictions from models that form a series of models. In an unweighted average, each model assumes the same weight when an ensemble model is constructed. More generally, one can think of using weighted averages so as to give more weight to those better models. But an even better approach might be to estimate weights more intelligently using another level of estimation through a classification algorithm. This approach is called model stacking.
Model stacking is an efficient ensemble method where predictions, generated using machine learning algorithms, are used as input in a second-level learning algorithm. This second-level algorithm is trained to optimally combine model predictions to form a new set of predictions. For example, when linear regression is used as second-level modeling, it estimates these weights by minimizing least squares errors. However, second-level modeling is not limited to linear models only; the relationship between predictors can be more complex, opening doors to employing other machine learning algorithms.
Overfitting is a particularly important problem in model stacking, because so many predictors are combined that all predict the same target. This can be partially caused by collinearity between predictors. One possible solution to this problem is to generate a diversified series of models with various methods (such as decision trees, neural networks and/or other methods) or use different types of data so as to try to insert new information not present in previous models into the final model.
Applying stacked models to real-world big data problems can produce greater accuracy and robustness of predictions compared to individual models. The model stacking approach shifts the modeling objective from finding the best model to finding a collection of truly valid complementary models. Naturally, this method involves additional computational costs both because it is necessary to train a large number of models and because it is necessary to use cross-validation to avoid overfitting.

Some examples of using this method are: bagging, random forests and Boosting which use many trees as weak learners that are combined to have a single prediction.

Obviously these methods increase computational times and reduce interpretability.

Model stacking

Stacked regression algorithm

  1. We create \(\hat{f}^{-i}_l(x_i)\) to make prediction with \(x_i\) using a model \(l\) fitted on training data where \(i\) refers to the \(i\)-th observations of training \((x_i,y_i)\)

  2. I get the weights with least squares estimation: \[\hat{w}_1,\ldots,\hat{w}_L = \underset{w_1,\ldots,w_L}{\arg \min} \sum_{i=1}^{n} \left[ y_i - \sum_{l=1}^{L} w_l \hat{f}^{-i}_l(x_i) \right]^2\]

  3. Finally I make predictions on the test set: \[\hat{f}_{\mathrm{stack}}(x^*_i) = \sum_{l=1}^{L} \hat{w}_l \hat{f}_l(x^*_i), \quad i=1,\ldots,m\]

Model stacking algorithm

  1. Partition the training data into \(k\) divisions(\(k-folds\)) \(\mathcal{F}_1,\ldots,\mathcal{F}_K\)

  2. For each division \(\mathcal{F}_k\), \(k=1,\ldots,K\) , I combine the other \(K-1\) data sets and use them to train the model, then:
    • I do a For loop for \(l=1,\ldots,L\), train the \(l\)-th models on training and then make prediction on division \(k\) \(\mathcal{F}_k\). I associate these predictions \[z_i = (\hat{f}_1^{-\mathcal{F}_k}(x_i),\ldots,\hat{f}_L^{-\mathcal{F}_k}(x_i)), \quad i \in \mathcal{F}_k\]
  3. For \(l=1,\ldots,L\), fit the \(l\)th model to the full training data and make predictions on the test data. Store these predictions \[z^*_i = (\hat{f}_1(x^*_i),\ldots,\hat{f}_L(x^*_i)), \quad i=1,\ldots,m\]

  4. Fit the stacking model \(\hat{f}_{\mathrm{stack}}\) using \[(y_1,z_1),\ldots, (y_n,z_n)\]

  5. Make final predictions \(\hat{y}^{*}_i = \hat{f}_{\mathrm{stack}}(z^*_i)\), \(i=1,\ldots,m\)


Boston data

rm(list=ls())
library(MASS)
set.seed(123)
istrain = rbinom(n=nrow(Boston),size=1,prob=0.5)>0
train <- Boston[istrain,]
n=nrow(train)
test = Boston[!istrain,-14]
test.y = Boston[!istrain,14]
m=nrow(test)

The training and test data are \[(x_1,y_1),\ldots,(x_n,y_n),\quad (x^*_1,y^*_1),\ldots,(x^*_m,y^*_m)\] with \(n=235\) and \(m=271\) for the Boston data set.

The response variable is medv, and the predictor variables are crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat


Stacked regression with two models

fit1 = lm(medv ~ ., train)
library(rpart)
fit2 = rpart(medv ~ ., train)

  1. \(\hat{f}^{-i}_l(x_i)\) corresponds to the prediction of the i-th observation of the training set made on the data of the latter obviously removing \((x_i,y_i)\) and using the l-th model.

  2. We calculate the weights: \[\hat{w}_1,\ldots,\hat{w}_L = \underset{w_1,\ldots,w_L}{\arg \min} \sum_{i=1}^{n} \left[ y_i - \sum_{l=1}^{L} w_l \hat{f}^{-i}_l(x_i) \right]^2\]

  3. We make predictions on the test set: \[\hat{f}_{\mathrm{stack}}(x^*_i) = \sum_{l=1}^{L} \hat{w}_l \hat{f}_l(x^*_i), \quad i=1,\ldots,m\]
  4. Calculate the mean squared error for the test set \[\mathrm{MSE}^{\mathrm{stack}}_{\mathrm{Te}} = \frac{1}{m}\sum_{i=1}^{m} (y^*_i - \hat{f}_{\mathrm{stack}}(x^*_i) )^2\] and compare it with\[\mathrm{MSE}^l_{\mathrm{Te}} = \frac{1}{m}\sum_{i=1}^{m} (y^*_i - \hat{f}_{l}(x^*_i) )^2, \quad l=1,\ldots,L.\]


## MSE stack:  21.95002
## MSE lm:  28.83767
## MSE rpart:  24.39206







Useful R code for ensemble

caretEnsemble

# K fold CV for regression
KCV <- trainControl(method="cv", 
                    number=K,
                    savePredictions="final",
                    index=createResample(train$y, K)
                    )
# list of models fit1 and fit2
List <- caretList(y ~. ,
                  data=train,
                  trControl=KCV,
                  methodList=c("fit1","fit2")
                  )
# ensemble fit
fit.ensemble <- caretEnsemble(List, metric="RMSE")
summary(fit.ensemble)

Boston data

# libraries
rm(list=ls())
library(MASS)
library(caret)
library(caretEnsemble)

# import data
set.seed(123)
split <- createDataPartition(y=Boston$medv, p = 0.5, list=FALSE)
train <- Boston[split,]
test = Boston[-split,-14]
test.y = Boston[-split,14]
nrow(test)

# cross-validation settings
K = 10
my_control <- trainControl(
  method="cv",
  number=K,
  savePredictions="final",
  index=createResample(train$medv, K)
  )

Library of models

model_list <- caretList(
  medv~., data=train,
  methodList=c("lm","ctree"), 
  tuneList=list(
    rf=caretModelSpec(method="rf", tuneLength=3)
  ),
  trControl=my_control
)

xyplot(resamples(model_list))
modelCor(resamples(model_list))

Ensemble

greedy_ensemble <- caretEnsemble(
  model_list, 
  metric="RMSE"
  )
summary(greedy_ensemble)

Test MSE

yhats <- lapply(model_list, predict, newdata=test)
lapply(yhats, function(yhat) mean((yhat - test.y)^2) )

yhat.en <- predict(greedy_ensemble, newdata=test)
mean((yhat.en - test.y)^2)