Boosting

Boosting is a general method for creating a complex prediction model that uses simple construction components called weak learners.
This algorithm starts by fitting a weak learner on the training data.
Subsequently, the weak learner is refitted, but giving more weight (importance) to observations that were poorly fitted/unclassified.
This process is repeated until a stopping rule is reached.

In this chapter, we will analyze three types of algorithms: * Boosting with squared error loss (L2-Boosting) uses regression trees as weak learners

Boosting with exponential loss (AdaBoost.M1) uses classification trees as weak learners
gradient Boosting (General Algorithm)

L2-Boosting algorithm for classification trees

L2-Boosting builds a model with regression trees by repeatedly fitting a regression tree to the residuals. This is different from Random Forest because each tree is trying to correct the errors made by the set of previously grown trees

Initialize \(\hat{f}(x)= \bar{y}\) and \(r_i = y_i - \bar{y}\) for \(i=1,\ldots,n\)
We perform a For loop for \(b=1,2,\ldots, B\) and repeat:
- We fit a tree \(\hat{f}^b\) with \(d\) splits on data \((x_1,r_1),\ldots, (x_n,r_n)\)
- We update \(\hat{f}\) by adding a version of the new tree: \[\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)\]
- We update the residuals: \[r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)\]
The model output is: \[\hat{f}(x) = \bar{y}+ \sum_{b=1}^{B} \lambda \hat{f}^b(x)\]

Tuning parameters for Boosting:

The number of trees \(B\): Unlike bagging and random forest, Boosting can overfit if \(B\) is too high, although this happens very slowly as \(B\) cresce, ma conviene sempre utilizzare la cross-validation per individuare questo parametro.
The shrinkage parameter \(\lambda\): This parameter is positive and close to zero and controls the rate at which the algorithm learns. Common values are 0.01 or 0.001; the choice depends on the problem, but as a guideline we can generalize (although it is always necessary to analyze problem by problem) that a small \(\lambda\) should correspond to a larger number of trees \(B\) for good performance.
The number of splits \(d\): This parameter controls the complexity of the single tree to be ensembled. Often \(d = 1\) works well as we create trees with only one split (called stumps), but it might be necessary to increase this parameter in some cases where the response variable is complex.

library(rpart)

rm(list=ls())
set.seed(123)
n = 100
x=sort(runif(n)*2*pi)
y=sin(x)+rnorm(n, sd=0.5)
plot(y~x)
curve(sin(x), min(x),max(x), col="red", add=T)

Below is an application that shows how the three parameters influence the development of the algorithm (this application was made with the free version of shinyapp, so I apologize for any inconvenience; I also provide the .Rmd code to facilitate viewing: code.Rmd)

EXAMPLE

rm(list=ls())
# data import
als <- read.table("http://web.stanford.edu/~hastie/CASI_files/DATA/ALS.txt",header=TRUE)
# split into training and test
train = als[als$testset==F,-1]
n = nrow(train)
test = als[als$testset==T,-1]
m = nrow(test)

We use the gbm() function to fit the Boosting model on the ALS data by setting:

\(B = 500\)
\(\lambda = 0.02\)
\(d = 4\)

# library gbm
  library(gbm)
set.seed(123)

# tuning parameters
d = 4
nu = 0.02
B = 500
K = 10

# boosted regression trees fit
fit <- gbm(dFRS ~ ., 
              distribution = "gaussian",
              data = train,
              n.trees = B,
              interaction.depth = d,
              shrinkage = nu,
              bag.fraction = 1, 
              cv.folds= K)

print(fit)

## gbm(formula = dFRS ~ ., distribution = "gaussian", data = train, 
  ##     n.trees = B, interaction.depth = d, shrinkage = nu, bag.fraction = 1, 
  ##     cv.folds = K)
  ## A gradient boosted model with gaussian loss function.
  ## 500 iterations were performed.
  ## The best cross-validation iteration was 292.
  ## There were 369 predictors of which 136 had non-zero influence.

# predictions at each step b
  yhats = sapply(1:B, function(b) predict(fit, newdata = test, n.trees = b))

# test MSE at each step b
MSEs = apply(yhats, 2, function(yhat) mean( (test$dFRS - yhat)^2))
plot(1:B, MSEs, xlab="Number of trees", ylab="test MSE", type="l")

which.min(MSEs)

## [1] 188

# bstop
bstop = gbm.perf(fit, method="cv")

##                                           var   rel.inf
## Onset.Delta                       Onset.Delta 30.443296
## alsfrs.score.slope         alsfrs.score.slope  6.051237
## mean.slope.svc.liters   mean.slope.svc.liters  3.774028
## mean.slope.weight           mean.slope.weight  3.098220
## last.slope.weight           last.slope.weight  3.003234
## last.alsfrs.score           last.alsfrs.score  2.939656
## last.slope.bp.systolic last.slope.bp.systolic  2.691271
## mean.slope.fvc.liters   mean.slope.fvc.liters  2.058366
## mean.speech                       mean.speech  2.033955
## max.dressing                     max.dressing  2.012729

# fitted functions
plot(fit, i.var="Onset.Delta", n.trees=bstop)

plot(fit, i.var="last.slope.weight", n.trees=bstop)

Adaboost

The AdaBoost.M1 algorithm (Freund and Schapire, 1997) was developed for binary classification problems \(Y \in \{-1,1\}\) and uses classification trees as weak learners.

Boosting was originally intended to increase the performance of weak learners in binary classification problems.
The idea was to train the classifier sequentially and at each step modify the training set to give more weight to poorly classified values.
The final classifier is calculated using a weighted average based on the classifiers' votes.

1. Initialize the observation weights as \(w_i=1/n, i=1,\ldots,n\)

Create a For loop for \(b=1,\ldots,B\) and repeat:
- Construct classification trees \(\hat{C}^{b}\) on the training set using weights \(w_i\) (as I would in a parametric bootstrap)
- Calculate the new weights using the errors made in the previous classification \(\hat{C}^{b}\) \[\mathrm{Err}^{b} = \frac{\sum_{i=1}^{n}w_i I\{ y_i \neq \hat{C}^{b}(x_i) \}}{\sum_{i=1}^{n}w_i}\]

Calculate \[\alpha^{b} = \log[ (1- \mathrm{Err}^{b})/\mathrm{Err}^{b}]\]
- Update the weights: \[w_i \leftarrow w_i \cdot \exp(\alpha^{b} \cdot I\{y_i \neq \hat{C}^{b}(x_i) \} ), \quad i=1,\ldots,n\]

The algorithm outputs are \(\hat{C}^{B}(x) = \mathrm{sign}( \sum_{b=1}^B \alpha^{b} \hat{C}^{b}(x) )\)

Gradient Boosting algorithm (gbm)

We start by setting \(\hat{f}^0(x)=0\), and we set \(B\) and the shrinkage parameter \(\lambda>0\) as we did in the previous algorithms
Create a For loop for \(b=1,\ldots,B\) and repeat the following steps:
1. Calcoliamo il gradiente negativo della funzione di perdita (pointwise negative gradient of the loss function) per ogni punti fittati \[r_i = -\frac{\partial L(y_i,f_i)}{\partial f_i}\left|_{f_i=\hat{f}^{b-1}(x_i)}\right., \quad i=1,\ldots,n\]
2. We approximate these values through a tree \(g\) with a depth \(d\): \[(x_1,r_1),\ldots,(x_n,r_n)\rightarrow \hat{g}(x)\]
3. We update the data from the previous step of the loop \[\hat{f}^b(x) = \hat{f}^{b-1}(x) + \lambda \hat{g}(x)\]
The output at the end of the loop is \(\hat{f}^b(x), b=1,\ldots,B\)

Loss functions

Il pacchetto gbm implementa due tipologie di funzioni di perdità \(L\) (which are in fact the most commonly used ones):
- Squared-error: \(L(y,f(x)) = \frac{1}{2} (y - f(x))^2\) with a negative gradient equal to \[- \frac{\partial}{\partial f(x)} \frac{1}{2}[y - f(x)]^2 = y - f(x)\]
- Laplace \(L(y,f(x)) = |y - f(x)|\) with a negative gradient equal to \[- \frac{\partial}{\partial f(x)} \,\, | y - f(x)| = \mathrm{sign}(y - f(x))\]

It can be noted that the Laplace loss function is not differentiable when \(y=f(x)\); the negative gradient values are set equal to 0.

This differs from the Adaboost algorithm for the classification problem with \(\tilde{y}\in \{-1,1\}\) which uses an exponential loss for the logistic regression model (exponential loss) \(L(\tilde{y},f(x)) = \exp(-\tilde{y} f(x))\) with a negative gradient equal to \[- \frac{\partial}{\partial f(x)} \,\, \exp[-\tilde{y} f(x) ] = \tilde{y} \exp[-\tilde{y} f(x) ]\]

Python in Practice

Below we demonstrate boosting methods comparing weak classifiers to boosted ensembles.

1. Weak Learner vs Boosted Ensemble

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=10, n_informative=6, random_state=42)

# Decision stump (weak learner)
stump = DecisionTreeClassifier(max_depth=1, random_state=42)
stump_score = cross_val_score(stump, X, y, cv=5).mean()

# AdaBoost with 100 stumps
ada = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_score = cross_val_score(ada, X, y, cv=5).mean()

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_score = cross_val_score(gb, X, y, cv=5).mean()

print(f"Stump (weak learner):  {stump_score:.3f}")
print(f"AdaBoost (100 stumps): {ada_score:.3f}")
print(f"Gradient Boosting:     {gb_score:.3f}")

# Output:
# Stump (weak learner):  0.718
# AdaBoost (100 stumps): 0.876
# Gradient Boosting:     0.922

2. Effect of Learning Rate

for lr in [0.01, 0.05, 0.1, 0.3, 1.0]:
    gb_temp = GradientBoostingClassifier(n_estimators=100, learning_rate=lr, max_depth=3, random_state=42)
    score = cross_val_score(gb_temp, X, y, cv=5).mean()
    print(f"  learning_rate={lr:.2f} -> Accuracy: {score:.3f}")

# Output:
#   learning_rate=0.01 -> Accuracy: 0.864
#   learning_rate=0.05 -> Accuracy: 0.908
#   learning_rate=0.10 -> Accuracy: 0.922
#   learning_rate=0.30 -> Accuracy: 0.914
#   learning_rate=1.00 -> Accuracy: 0.894

3. Feature Importance

gb.fit(X, y)
importances = gb.feature_importances_
for i in np.argsort(importances)[::-1][:5]:
    print(f"  Feature {i}: {importances[i]:.4f}")

# Output:
#   Feature 0: 0.2145
#   Feature 3: 0.1876
#   Feature 1: 0.1654
#   Feature 4: 0.1423
#   Feature 2: 0.1012

Boosting

In-depth Articles

Boosting

L2-Boosting algorithm for classification trees

EXAMPLE

Adaboost

Gradient Boosting algorithm (gbm)

Python in Practice

1. Weak Learner vs Boosted Ensemble

2. Effect of Learning Rate

3. Feature Importance

Results