Boosting is a general method for creating a complex prediction model that uses simple construction components called weak learners.
This algorithm starts by fitting a weak learner on the training data.
Subsequently, the weak learner is refitted, but giving more weight (importance) to observations that were poorly fitted/unclassified.
This process is repeated until a stopping rule is reached.
In this chapter, we will analyze three types of algorithms: * Boosting with squared error loss (L2-Boosting) uses regression trees as weak learners
Boosting with exponential loss (AdaBoost.M1) uses classification trees as weak learners
gradient Boosting (General Algorithm)
L2-Boosting builds a model with regression trees by repeatedly fitting a regression tree to the residuals. This is different from Random Forest because each tree is trying to correct the errors made by the set of previously grown trees
Initialize \(\hat{f}(x)= \bar{y}\) and \(r_i = y_i - \bar{y}\) for \(i=1,\ldots,n\)
We perform a For loop for \(b=1,2,\ldots, B\) and repeat:
The model output is: \[\hat{f}(x) = \bar{y}+ \sum_{b=1}^{B} \lambda \hat{f}^b(x)\]
Tuning parameters for Boosting:
The number of trees \(B\): Unlike bagging and random forest, Boosting can overfit if \(B\) is too high, although this happens very slowly as \(B\) cresce, ma conviene sempre utilizzare la cross-validation per individuare questo parametro.
The shrinkage parameter \(\lambda\): This parameter is positive and close to zero and controls the rate at which the algorithm learns. Common values are 0.01 or 0.001; the choice depends on the problem, but as a guideline we can generalize (although it is always necessary to analyze problem by problem) that a small \(\lambda\) should correspond to a larger number of trees \(B\) for good performance.
The number of splits \(d\): This parameter controls the complexity of the single tree to be ensembled. Often \(d = 1\) works well as we create trees with only one split (called stumps), but it might be necessary to increase this parameter in some cases where the response variable is complex.
library(rpart)
rm(list=ls())
set.seed(123)
n = 100
x=sort(runif(n)*2*pi)
y=sin(x)+rnorm(n, sd=0.5)
plot(y~x)
curve(sin(x), min(x),max(x), col="red", add=T)
Below is an application that shows how the three parameters influence the development of the algorithm (this application was made with the free version of shinyapp, so I apologize for any inconvenience; I also provide the .Rmd code to facilitate viewing: code.Rmd)
rm(list=ls())
# data import
als <- read.table("http://web.stanford.edu/~hastie/CASI_files/DATA/ALS.txt",header=TRUE)
# split into training and test
train = als[als$testset==F,-1]
n = nrow(train)
test = als[als$testset==T,-1]
m = nrow(test)
We use the gbm() function to fit the Boosting model on the ALS data by setting:
# library gbm
library(gbm)
set.seed(123)
# tuning parameters
d = 4
nu = 0.02
B = 500
K = 10
# boosted regression trees fit
fit <- gbm(dFRS ~ .,
distribution = "gaussian",
data = train,
n.trees = B,
interaction.depth = d,
shrinkage = nu,
bag.fraction = 1,
cv.folds= K)
print(fit)
## gbm(formula = dFRS ~ ., distribution = "gaussian", data = train,
## n.trees = B, interaction.depth = d, shrinkage = nu, bag.fraction = 1,
## cv.folds = K)
## A gradient boosted model with gaussian loss function.
## 500 iterations were performed.
## The best cross-validation iteration was 292.
## There were 369 predictors of which 136 had non-zero influence.
# predictions at each step b
yhats = sapply(1:B, function(b) predict(fit, newdata = test, n.trees = b))
# test MSE at each step b
MSEs = apply(yhats, 2, function(yhat) mean( (test$dFRS - yhat)^2))
plot(1:B, MSEs, xlab="Number of trees", ylab="test MSE", type="l")
which.min(MSEs)
## [1] 188
# bstop
bstop = gbm.perf(fit, method="cv")
## var rel.inf
## Onset.Delta Onset.Delta 30.443296
## alsfrs.score.slope alsfrs.score.slope 6.051237
## mean.slope.svc.liters mean.slope.svc.liters 3.774028
## mean.slope.weight mean.slope.weight 3.098220
## last.slope.weight last.slope.weight 3.003234
## last.alsfrs.score last.alsfrs.score 2.939656
## last.slope.bp.systolic last.slope.bp.systolic 2.691271
## mean.slope.fvc.liters mean.slope.fvc.liters 2.058366
## mean.speech mean.speech 2.033955
## max.dressing max.dressing 2.012729
# fitted functions
plot(fit, i.var="Onset.Delta", n.trees=bstop)
plot(fit, i.var="last.slope.weight", n.trees=bstop)
The AdaBoost.M1 algorithm (Freund and Schapire, 1997) was developed for binary classification problems \(Y \in \{-1,1\}\) e uses classification trees as weak learners.
Boosting was originally intended to increase the performance of weak learners in binary classification problems.
The idea was to train the classifier sequentially and at each step modify the training set to give more weight to poorly classified values.
The final classifier is calculated using a weighted average based on the classifiers' votes.
1. Initialize i pesi delle osservazioni pari a \(w_i=1/n, i=1,\ldots,n\)
Create a For loop for \(b=1,\ldots,B\) and repeat:
Calculate \[\alpha^{b} = \log[ (1- \mathrm{Err}^{b})/\mathrm{Err}^{b}]\]
We start by setting \(\hat{f}^0(x)=0\), and we set \(B\) and the shrinkage parameter \(\lambda>0\) as we did in the previous algorithms
Create a For loop for \(b=1,\ldots,B\) and repeat the following steps:
Calcoliamo il gradiente negativo della funzione di perdita (pointwise negative gradient of the loss function) per ogni punti fittati \[r_i = -\frac{\partial L(y_i,f_i)}{\partial f_i}\left|_{f_i=\hat{f}^{b-1}(x_i)}\right., \quad i=1,\ldots,n\]
Approssimiamo qeusti valori attraverso un albero \(g\) con una profondità \(d\): \[(x_1,r_1),\ldots,(x_n,r_n)\rightarrow \hat{g}(x)\]
We update i dati dello step precedente del ciclo \[\hat{f}^b(x) = \hat{f}^{b-1}(x) + \lambda \hat{g}(x)\]
L’output alla fine dei ciclo è \(\hat{f}^b(x), b=1,\ldots,B\)
Loss functions
gbm implementa due tipologie di funzioni di perdità \(L\) (che poi di fatto sono quelle più usate):
Si può notare che la funzione di perdità di Laplace non è differenziabile quando \(y=f(x)\); i valori negtivi del gradiente vengono posti uguali a 0.