Boosting


In-depth Articles

Boosting

Boosting is a general method for creating a complex prediction model that uses simple construction components called weak learners.
This algorithm starts by fitting a weak learner on the training data.
Subsequently, the weak learner is refitted, but giving more weight (importance) to observations that were poorly fitted/unclassified.
This process is repeated until a stopping rule is reached.

In this chapter, we will analyze three types of algorithms: * Boosting with squared error loss (L2-Boosting) uses regression trees as weak learners


L2-Boosting algorithm for classification trees

L2-Boosting builds a model with regression trees by repeatedly fitting a regression tree to the residuals. This is different from Random Forest because each tree is trying to correct the errors made by the set of previously grown trees


  1. Initialize \(\hat{f}(x)= \bar{y}\) and \(r_i = y_i - \bar{y}\) for \(i=1,\ldots,n\)

  2. We perform a For loop for \(b=1,2,\ldots, B\) and repeat:

    • We fit a tree \(\hat{f}^b\) with \(d\) splits on data \((x_1,r_1),\ldots, (x_n,r_n)\)
    • We update \(\hat{f}\) by adding a version of the new tree: \[\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)\]
    • We update the residuals: \[r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)\]
  3. The model output is: \[\hat{f}(x) = \bar{y}+ \sum_{b=1}^{B} \lambda \hat{f}^b(x)\]

Tuning parameters for Boosting:

library(rpart)

rm(list=ls())
set.seed(123)
n = 100
x=sort(runif(n)*2*pi)
y=sin(x)+rnorm(n, sd=0.5)
plot(y~x)
curve(sin(x), min(x),max(x), col="red", add=T)

Below is an application that shows how the three parameters influence the development of the algorithm (this application was made with the free version of shinyapp, so I apologize for any inconvenience; I also provide the .Rmd code to facilitate viewing: code.Rmd)

EXAMPLE

rm(list=ls())
# data import
als <- read.table("http://web.stanford.edu/~hastie/CASI_files/DATA/ALS.txt",header=TRUE)
# split into training and test
train = als[als$testset==F,-1]
n = nrow(train)
test = als[als$testset==T,-1]
m = nrow(test)

We use the gbm() function to fit the Boosting model on the ALS data by setting:

# library gbm
  library(gbm)
set.seed(123)

# tuning parameters
d = 4
nu = 0.02
B = 500
K = 10

# boosted regression trees fit
fit <- gbm(dFRS ~ ., 
              distribution = "gaussian",
              data = train,
              n.trees = B,
              interaction.depth = d,
              shrinkage = nu,
              bag.fraction = 1, 
              cv.folds= K)

print(fit)
## gbm(formula = dFRS ~ ., distribution = "gaussian", data = train, 
  ##     n.trees = B, interaction.depth = d, shrinkage = nu, bag.fraction = 1, 
  ##     cv.folds = K)
  ## A gradient boosted model with gaussian loss function.
  ## 500 iterations were performed.
  ## The best cross-validation iteration was 292.
  ## There were 369 predictors of which 136 had non-zero influence.
# predictions at each step b
  yhats = sapply(1:B, function(b) predict(fit, newdata = test, n.trees = b))

# test MSE at each step b
MSEs = apply(yhats, 2, function(yhat) mean( (test$dFRS - yhat)^2))
plot(1:B, MSEs, xlab="Number of trees", ylab="test MSE", type="l")

which.min(MSEs)
## [1] 188
# bstop
bstop = gbm.perf(fit, method="cv")

##                                           var   rel.inf
## Onset.Delta                       Onset.Delta 30.443296
## alsfrs.score.slope         alsfrs.score.slope  6.051237
## mean.slope.svc.liters   mean.slope.svc.liters  3.774028
## mean.slope.weight           mean.slope.weight  3.098220
## last.slope.weight           last.slope.weight  3.003234
## last.alsfrs.score           last.alsfrs.score  2.939656
## last.slope.bp.systolic last.slope.bp.systolic  2.691271
## mean.slope.fvc.liters   mean.slope.fvc.liters  2.058366
## mean.speech                       mean.speech  2.033955
## max.dressing                     max.dressing  2.012729
# fitted functions
plot(fit, i.var="Onset.Delta", n.trees=bstop)

plot(fit, i.var="last.slope.weight", n.trees=bstop)

Adaboost

The AdaBoost.M1 algorithm (Freund and Schapire, 1997) was developed for binary classification problems \(Y \in \{-1,1\}\) e uses classification trees as weak learners.



1. Initialize i pesi delle osservazioni pari a \(w_i=1/n, i=1,\ldots,n\)

  1. Create a For loop for \(b=1,\ldots,B\) and repeat:

    • Construct classification trees \(\hat{C}^{b}\) on the training set using weights \(w_i\) (as I would in a parametric bootstrap)
    • Calculate the new weights using the errors made in the previous classification \(\hat{C}^{b}\) \[\mathrm{Err}^{b} = \frac{\sum_{i=1}^{n}w_i I\{ y_i \neq \hat{C}^{b}(x_i) \}}{\sum_{i=1}^{n}w_i}\]
  1. The algorithm outputs are \(\hat{C}^{B}(x) = \mathrm{sign}( \sum_{b=1}^B \alpha^{b} \hat{C}^{b}(x) )\)

Gradient Boosting algorithm (gbm)

  1. We start by setting \(\hat{f}^0(x)=0\), and we set \(B\) and the shrinkage parameter \(\lambda>0\) as we did in the previous algorithms

  2. Create a For loop for \(b=1,\ldots,B\) and repeat the following steps:

    1. Calcoliamo il gradiente negativo della funzione di perdita (pointwise negative gradient of the loss function) per ogni punti fittati \[r_i = -\frac{\partial L(y_i,f_i)}{\partial f_i}\left|_{f_i=\hat{f}^{b-1}(x_i)}\right., \quad i=1,\ldots,n\]

    2. Approssimiamo qeusti valori attraverso un albero \(g\) con una profondità \(d\): \[(x_1,r_1),\ldots,(x_n,r_n)\rightarrow \hat{g}(x)\]

    3. We update i dati dello step precedente del ciclo \[\hat{f}^b(x) = \hat{f}^{b-1}(x) + \lambda \hat{g}(x)\]

  3. L’output alla fine dei ciclo è \(\hat{f}^b(x), b=1,\ldots,B\)

Loss functions

Si può notare che la funzione di perdità di Laplace non è differenziabile quando \(y=f(x)\); i valori negtivi del gradiente vengono posti uguali a 0.