Title: | Generalized Boosted Regression Models |
---|---|
Description: | An implementation of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. Includes regression methods for least squares, absolute loss, t-distribution loss, quantile regression, logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart). Originally developed by Greg Ridgeway. Newer version available at github.com/gbm-developers/gbm3. |
Authors: | Greg Ridgeway [aut, cre] , Daniel Edwards [ctb], Brian Kriegler [ctb], Stefan Schroedl [ctb], Harry Southworth [ctb], Brandon Greenwell [ctb] , Bradley Boehmke [ctb] , Jay Cunningham [ctb], GBM Developers [aut] (https://github.com/gbm-developers) |
Maintainer: | Greg Ridgeway <[email protected]> |
License: | GPL (>= 2) | file LICENSE |
Version: | 2.2.2 |
Built: | 2024-11-12 05:12:16 UTC |
Source: | https://github.com/gbm-developers/gbm |
This package implements extensions to Freund and Schapire's AdaBoost algorithm and J. Friedman's gradient boosting machine. Includes regression methods for least squares, absolute loss, logistic, Poisson, Cox proportional hazards partial likelihood, multinomial, t-distribution, AdaBoost exponential loss, Learning to Rank, and Huberized hinge loss. This gbm package is no longer under further development. Consider https://github.com/gbm-developers/gbm3 for the latest version.
Further information is available in vignette:
browseVignettes(package = "gbm")
Greg Ridgeway [email protected] with contributions by Daniel Edwards, Brian Kriegler, Stefan Schroedl, Harry Southworth, and Brandon Greenwell
Y. Freund and R.E. Schapire (1997) “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 55(1):119-139.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
J.H. Friedman, T. Hastie, R. Tibshirani (2000). “Additive Logistic Regression: a Statistical View of Boosting,” Annals of Statistics 28(2):337-374.
J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.
J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.
The MART website.
Useful links:
Computes the Breslow estimator of the baseline hazard function for a proportional hazard regression model.
basehaz.gbm(t, delta, f.x, t.eval = NULL, smooth = FALSE, cumulative = TRUE)
basehaz.gbm(t, delta, f.x, t.eval = NULL, smooth = FALSE, cumulative = TRUE)
t |
The survival times. |
delta |
The censoring indicator. |
f.x |
The predicted values of the regression model on the log hazard scale. |
t.eval |
Values at which the baseline hazard will be evaluated. |
smooth |
If |
cumulative |
If |
The proportional hazard model assumes h(t|x)=lambda(t)*exp(f(x)).
gbm
can estimate the f(x) component via partial likelihood.
After estimating f(x), basehaz.gbm
can compute the a nonparametric
estimate of lambda(t).
A vector of length equal to the length of t (or of length
t.eval
if t.eval
is not NULL
) containing the baseline
hazard evaluated at t (or at t.eval
if t.eval
is not
NULL
). If cumulative
is set to TRUE
then the returned
vector evaluates the cumulative hazard function at those values.
Greg Ridgeway [email protected]
N. Breslow (1972). "Discussion of 'Regression Models and Life-Tables' by D.R. Cox," Journal of the Royal Statistical Society, Series B, 34(2):216-217.
N. Breslow (1974). "Covariance analysis of censored survival data," Biometrics 30:89-99.
An experimental diagnostic tool that plots the fitted values versus the
actual average values. Currently only available when
distribution = "bernoulli"
.
calibrate.plot( y, p, distribution = "bernoulli", replace = TRUE, line.par = list(col = "black"), shade.col = "lightyellow", shade.density = NULL, rug.par = list(side = 1), xlab = "Predicted value", ylab = "Observed average", xlim = NULL, ylim = NULL, knots = NULL, df = 6, ... )
calibrate.plot( y, p, distribution = "bernoulli", replace = TRUE, line.par = list(col = "black"), shade.col = "lightyellow", shade.density = NULL, rug.par = list(side = 1), xlab = "Predicted value", ylab = "Observed average", xlim = NULL, ylim = NULL, knots = NULL, df = 6, ... )
y |
The outcome 0-1 variable. |
p |
The predictions estimating E(y|x). |
distribution |
The loss function used in creating |
replace |
Determines whether this plot will replace or overlay the
current plot. |
line.par |
Graphics parameters for the line. |
shade.col |
Color for shading the 2 SE region. |
shade.density |
The |
rug.par |
Graphics parameters passed to |
xlab |
x-axis label corresponding to the predicted values. |
ylab |
y-axis label corresponding to the observed average. |
xlim , ylim
|
x- and y-axis limits. If not specified te function will select limits. |
knots , df
|
These parameters are passed directly to
|
... |
Additional optional arguments to be passed onto
|
Uses natural splines to estimate E(y|p). Well-calibrated predictions imply that E(y|p) = p. The plot also includes a pointwise 95
No return values.
Greg Ridgeway [email protected]
J.F. Yates (1982). "External correspondence: decomposition of the mean probability score," Organisational Behaviour and Human Performance 30:132-156.
D.J. Spiegelhalter (1986). "Probabilistic Prediction in Patient Management and Clinical Trials," Statistics in Medicine 5:421-433.
# Don't want R CMD check to think there is a dependency on rpart # so comment out the example #library(rpart) #data(kyphosis) #y <- as.numeric(kyphosis$Kyphosis)-1 #x <- kyphosis$Age #glm1 <- glm(y~poly(x,2),family=binomial) #p <- predict(glm1,type="response") #calibrate.plot(y, p, xlim=c(0,0.6), ylim=c(0,0.6))
# Don't want R CMD check to think there is a dependency on rpart # so comment out the example #library(rpart) #data(kyphosis) #y <- as.numeric(kyphosis$Kyphosis)-1 #x <- kyphosis$Age #glm1 <- glm(y~poly(x,2),family=binomial) #p <- predict(glm1,type="response") #calibrate.plot(y, p, xlim=c(0,0.6), ylim=c(0,0.6))
Fits generalized boosted regression models. For technical details, see the
vignette: utils::browseVignettes("gbm")
.
gbm( formula = formula(data), distribution = "bernoulli", data = list(), weights, var.monotone = NULL, n.trees = 100, interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.1, bag.fraction = 0.5, train.fraction = 1, cv.folds = 0, keep.data = TRUE, verbose = FALSE, class.stratify.cv = NULL, n.cores = NULL )
gbm( formula = formula(data), distribution = "bernoulli", data = list(), weights, var.monotone = NULL, n.trees = 100, interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.1, bag.fraction = 0.5, train.fraction = 1, cv.folds = 0, keep.data = TRUE, verbose = FALSE, class.stratify.cv = NULL, n.cores = NULL )
formula |
A symbolic description of the model to be fit. The formula
may include an offset term (e.g. y~offset(n)+x). If
|
distribution |
Either a character string specifying the name of the
distribution to use or a list with a component Currently available options are If quantile regression is specified, If If "pairwise" regression is specified,
Note that splitting of instances into training and validation sets follows
group boundaries and therefore only approximates the specified
Weights can be used in conjunction with pairwise metrics, however it is assumed that they are constant for instances from the same group. For details and background on the algorithm, see e.g. Burges (2010). |
data |
an optional data frame containing the variables in the model. By
default the variables are taken from |
weights |
an optional vector of weights to be used in the fitting
process. Must be positive but do not need to be normalized. If
|
var.monotone |
an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome. |
n.trees |
Integer specifying the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. Default is 100. |
interaction.depth |
Integer specifying the maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. Default is 1. |
n.minobsinnode |
Integer specifying the minimum number of observations in the terminal nodes of the trees. Note that this is the actual number of observations, not the total weight. |
shrinkage |
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default is 0.1. |
bag.fraction |
the fraction of the training set observations randomly
selected to propose the next tree in the expansion. This introduces
randomnesses into the model fit. If |
train.fraction |
The first |
cv.folds |
Number of cross-validation folds to perform. If
|
keep.data |
a logical variable indicating whether to keep the data and
an index of the data stored with the object. Keeping the data and index
makes subsequent calls to |
verbose |
Logical indicating whether or not to print out progress and
performance indicators ( |
class.stratify.cv |
Logical indicating whether or not the
cross-validation should be stratified by class. Defaults to |
n.cores |
The number of CPU cores to use. The cross-validation loop
will attempt to send different CV folds off to different cores. If
|
gbm.fit
provides the link between R and the C++ gbm engine.
gbm
is a front-end to gbm.fit
that uses the familiar R
modeling formulas. However, model.frame
is very slow if
there are many predictor variables. For power-users with many variables use
gbm.fit
. For general practice gbm
is preferable.
This package implements the generalized boosted modeling framework. Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function. This implementation closely follows Friedman's Gradient Boosting Machine (Friedman, 2001).
In addition to many of the features documented in the Gradient Boosting
Machine, gbm
offers additional features including the out-of-bag
estimator for the optimal number of iterations, the ability to store and
manipulate the resulting gbm
object, and a variety of other loss
functions that had not previously had associated boosting algorithms,
including the Cox partial likelihood for censored data, the poisson
likelihood for count outcomes, and a gradient boosting implementation to
minimize the AdaBoost exponential loss function. This gbm package is no
longer under further development. Consider
https://github.com/gbm-developers/gbm3 for the latest version.
A gbm.object
object.
Greg Ridgeway [email protected]
Quantile regression code developed by Brian Kriegler [email protected]
t-distribution, and multinomial code developed by Harry Southworth and Daniel Edwards
Pairwise code developed by Stefan Schroedl [email protected]
Y. Freund and R.E. Schapire (1997) “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 55(1):119-139.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
J.H. Friedman, T. Hastie, R. Tibshirani (2000). “Additive Logistic Regression: a Statistical View of Boosting,” Annals of Statistics 28(2):337-374.
J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.
J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.
B. Kriegler (2007). Cost-Sensitive Stochastic Gradient Boosting Within a Quantitative Regression Framework. Ph.D. Dissertation. University of California at Los Angeles, Los Angeles, CA, USA. Advisor(s) Richard A. Berk. https://dl.acm.org/doi/book/10.5555/1354603.
C. Burges (2010). “From RankNet to LambdaRank to LambdaMART: An Overview,” Microsoft Research Technical Report MSR-TR-2010-82.
gbm.object
, gbm.perf
,
plot.gbm
, predict.gbm
, summary.gbm
,
and pretty.gbm.tree
.
# # A least squares regression example # # Simulate data set.seed(101) # for reproducibility N <- 1000 X1 <- runif(N) X2 <- 2 * runif(N) X3 <- ordered(sample(letters[1:4], N, replace = TRUE), levels = letters[4:1]) X4 <- factor(sample(letters[1:6], N, replace = TRUE)) X5 <- factor(sample(letters[1:3], N, replace = TRUE)) X6 <- 3 * runif(N) mu <- c(-1, 0, 1, 2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu sigma <- sqrt(var(Y) / SNR) Y <- Y + rnorm(N, 0, sigma) X1[sample(1:N,size=500)] <- NA # introduce some missing values X4[sample(1:N,size=300)] <- NA # introduce some missing values data <- data.frame(Y, X1, X2, X3, X4, X5, X6) # Fit a GBM set.seed(102) # for reproducibility gbm1 <- gbm(Y ~ ., data = data, var.monotone = c(0, 0, 0, 0, 0, 0), distribution = "gaussian", n.trees = 100, shrinkage = 0.1, interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5, n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE, verbose = FALSE, n.cores = 1) # Check performance using the out-of-bag (OOB) error; the OOB error typically # underestimates the optimal number of iterations best.iter <- gbm.perf(gbm1, method = "OOB") print(best.iter) # Check performance using the 50% heldout test set best.iter <- gbm.perf(gbm1, method = "test") print(best.iter) # Check performance using 5-fold cross-validation best.iter <- gbm.perf(gbm1, method = "cv") print(best.iter) # Plot relative influence of each variable par(mfrow = c(1, 2)) summary(gbm1, n.trees = 1) # using first tree summary(gbm1, n.trees = best.iter) # using estimated best number of trees # Compactly print the first and last trees for curiosity print(pretty.gbm.tree(gbm1, i.tree = 1)) print(pretty.gbm.tree(gbm1, i.tree = gbm1$n.trees)) # Simulate new data set.seed(103) # for reproducibility N <- 1000 X1 <- runif(N) X2 <- 2 * runif(N) X3 <- ordered(sample(letters[1:4], N, replace = TRUE)) X4 <- factor(sample(letters[1:6], N, replace = TRUE)) X5 <- factor(sample(letters[1:3], N, replace = TRUE)) X6 <- 3 * runif(N) mu <- c(-1, 0, 1, 2)[as.numeric(X3)] Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu + rnorm(N, 0, sigma) data2 <- data.frame(Y, X1, X2, X3, X4, X5, X6) # Predict on the new data using the "best" number of trees; by default, # predictions will be on the link scale Yhat <- predict(gbm1, newdata = data2, n.trees = best.iter, type = "link") # least squares error print(sum((data2$Y - Yhat)^2)) # Construct univariate partial dependence plots plot(gbm1, i.var = 1, n.trees = best.iter) plot(gbm1, i.var = 2, n.trees = best.iter) plot(gbm1, i.var = "X3", n.trees = best.iter) # can use index or name # Construct bivariate partial dependence plots plot(gbm1, i.var = 1:2, n.trees = best.iter) plot(gbm1, i.var = c("X2", "X3"), n.trees = best.iter) plot(gbm1, i.var = 3:4, n.trees = best.iter) # Construct trivariate partial dependence plots plot(gbm1, i.var = c(1, 2, 6), n.trees = best.iter, continuous.resolution = 20) plot(gbm1, i.var = 1:3, n.trees = best.iter) plot(gbm1, i.var = 2:4, n.trees = best.iter) plot(gbm1, i.var = 3:5, n.trees = best.iter) # Add more (i.e., 100) boosting iterations to the ensemble gbm2 <- gbm.more(gbm1, n.new.trees = 100, verbose = FALSE)
# # A least squares regression example # # Simulate data set.seed(101) # for reproducibility N <- 1000 X1 <- runif(N) X2 <- 2 * runif(N) X3 <- ordered(sample(letters[1:4], N, replace = TRUE), levels = letters[4:1]) X4 <- factor(sample(letters[1:6], N, replace = TRUE)) X5 <- factor(sample(letters[1:3], N, replace = TRUE)) X6 <- 3 * runif(N) mu <- c(-1, 0, 1, 2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu sigma <- sqrt(var(Y) / SNR) Y <- Y + rnorm(N, 0, sigma) X1[sample(1:N,size=500)] <- NA # introduce some missing values X4[sample(1:N,size=300)] <- NA # introduce some missing values data <- data.frame(Y, X1, X2, X3, X4, X5, X6) # Fit a GBM set.seed(102) # for reproducibility gbm1 <- gbm(Y ~ ., data = data, var.monotone = c(0, 0, 0, 0, 0, 0), distribution = "gaussian", n.trees = 100, shrinkage = 0.1, interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5, n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE, verbose = FALSE, n.cores = 1) # Check performance using the out-of-bag (OOB) error; the OOB error typically # underestimates the optimal number of iterations best.iter <- gbm.perf(gbm1, method = "OOB") print(best.iter) # Check performance using the 50% heldout test set best.iter <- gbm.perf(gbm1, method = "test") print(best.iter) # Check performance using 5-fold cross-validation best.iter <- gbm.perf(gbm1, method = "cv") print(best.iter) # Plot relative influence of each variable par(mfrow = c(1, 2)) summary(gbm1, n.trees = 1) # using first tree summary(gbm1, n.trees = best.iter) # using estimated best number of trees # Compactly print the first and last trees for curiosity print(pretty.gbm.tree(gbm1, i.tree = 1)) print(pretty.gbm.tree(gbm1, i.tree = gbm1$n.trees)) # Simulate new data set.seed(103) # for reproducibility N <- 1000 X1 <- runif(N) X2 <- 2 * runif(N) X3 <- ordered(sample(letters[1:4], N, replace = TRUE)) X4 <- factor(sample(letters[1:6], N, replace = TRUE)) X5 <- factor(sample(letters[1:3], N, replace = TRUE)) X6 <- 3 * runif(N) mu <- c(-1, 0, 1, 2)[as.numeric(X3)] Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu + rnorm(N, 0, sigma) data2 <- data.frame(Y, X1, X2, X3, X4, X5, X6) # Predict on the new data using the "best" number of trees; by default, # predictions will be on the link scale Yhat <- predict(gbm1, newdata = data2, n.trees = best.iter, type = "link") # least squares error print(sum((data2$Y - Yhat)^2)) # Construct univariate partial dependence plots plot(gbm1, i.var = 1, n.trees = best.iter) plot(gbm1, i.var = 2, n.trees = best.iter) plot(gbm1, i.var = "X3", n.trees = best.iter) # can use index or name # Construct bivariate partial dependence plots plot(gbm1, i.var = 1:2, n.trees = best.iter) plot(gbm1, i.var = c("X2", "X3"), n.trees = best.iter) plot(gbm1, i.var = 3:4, n.trees = best.iter) # Construct trivariate partial dependence plots plot(gbm1, i.var = c(1, 2, 6), n.trees = best.iter, continuous.resolution = 20) plot(gbm1, i.var = 1:3, n.trees = best.iter) plot(gbm1, i.var = 2:4, n.trees = best.iter) plot(gbm1, i.var = 3:5, n.trees = best.iter) # Add more (i.e., 100) boosting iterations to the ensemble gbm2 <- gbm.more(gbm1, n.new.trees = 100, verbose = FALSE)
Workhorse function providing the link between R and the C++ gbm engine.
gbm
is a front-end to gbm.fit
that uses the familiar R
modeling formulas. However, model.frame
is very slow if
there are many predictor variables. For power-users with many variables use
gbm.fit
. For general practice gbm
is preferable.
gbm.fit( x, y, offset = NULL, misc = NULL, distribution = "bernoulli", w = NULL, var.monotone = NULL, n.trees = 100, interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, nTrain = NULL, train.fraction = NULL, keep.data = TRUE, verbose = TRUE, var.names = NULL, response.name = "y", group = NULL )
gbm.fit( x, y, offset = NULL, misc = NULL, distribution = "bernoulli", w = NULL, var.monotone = NULL, n.trees = 100, interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, nTrain = NULL, train.fraction = NULL, keep.data = TRUE, verbose = TRUE, var.names = NULL, response.name = "y", group = NULL )
x |
A data frame or matrix containing the predictor variables. The
number of rows in |
y |
A vector of outcomes. The number of rows in |
offset |
A vector of offset values. |
misc |
An R object that is simply passed on to the gbm engine. It can be used for additional data for the specific distribution. Currently it is only used for passing the censoring indicator for the Cox proportional hazards model. |
distribution |
Either a character string specifying the name of the
distribution to use or a list with a component Currently available options are If quantile regression is specified, If If "pairwise" regression is specified,
Note that splitting of instances into training and validation sets follows
group boundaries and therefore only approximates the specified
Weights can be used in conjunction with pairwise metrics, however it is assumed that they are constant for instances from the same group. For details and background on the algorithm, see e.g. Burges (2010). |
w |
A vector of weights of the same length as the |
var.monotone |
an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome. |
n.trees |
the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. |
interaction.depth |
The maximum depth of variable interactions. A value
of 1 implies an additive model, a value of 2 implies a model with up to 2-way
interactions, etc. Default is |
n.minobsinnode |
Integer specifying the minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight. |
shrinkage |
The shrinkage parameter applied to each tree in the
expansion. Also known as the learning rate or step-size reduction; 0.001 to
0.1 usually work, but a smaller learning rate typically requires more trees.
Default is |
bag.fraction |
The fraction of the training set observations randomly
selected to propose the next tree in the expansion. This introduces
randomnesses into the model fit. If |
nTrain |
An integer representing the number of cases on which to train.
This is the preferred way of specification for |
train.fraction |
The first |
keep.data |
Logical indicating whether or not to keep the data and an
index of the data stored with the object. Keeping the data and index makes
subsequent calls to |
verbose |
Logical indicating whether or not to print out progress and
performance indicators ( |
var.names |
Vector of strings of length equal to the number of columns
of |
response.name |
Character string label for the response variable. |
group |
The |
This package implements the generalized boosted modeling framework. Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function. This implementation closely follows Friedman's Gradient Boosting Machine (Friedman, 2001).
In addition to many of the features documented in the Gradient Boosting
Machine, gbm
offers additional features including the out-of-bag
estimator for the optimal number of iterations, the ability to store and
manipulate the resulting gbm
object, and a variety of other loss
functions that had not previously had associated boosting algorithms,
including the Cox partial likelihood for censored data, the poisson
likelihood for count outcomes, and a gradient boosting implementation to
minimize the AdaBoost exponential loss function.
A gbm.object
object.
Greg Ridgeway [email protected]
Quantile regression code developed by Brian Kriegler [email protected]
t-distribution, and multinomial code developed by Harry Southworth and Daniel Edwards
Pairwise code developed by Stefan Schroedl [email protected]
Y. Freund and R.E. Schapire (1997) “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 55(1):119-139.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
J.H. Friedman, T. Hastie, R. Tibshirani (2000). “Additive Logistic Regression: a Statistical View of Boosting,” Annals of Statistics 28(2):337-374.
J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.
J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.
B. Kriegler (2007). Cost-Sensitive Stochastic Gradient Boosting Within a Quantitative Regression Framework. Ph.D. Dissertation. University of California at Los Angeles, Los Angeles, CA, USA. Advisor(s) Richard A. Berk. https://dl.acm.org/doi/book/10.5555/1354603.
C. Burges (2010). “From RankNet to LambdaRank to LambdaMART: An Overview,” Microsoft Research Technical Report MSR-TR-2010-82.
gbm.object
, gbm.perf
,
plot.gbm
, predict.gbm
, summary.gbm
,
and pretty.gbm.tree
.
Adds additional trees to a gbm.object
object.
gbm.more( object, n.new.trees = 100, data = NULL, weights = NULL, offset = NULL, verbose = NULL )
gbm.more( object, n.new.trees = 100, data = NULL, weights = NULL, offset = NULL, verbose = NULL )
object |
A |
n.new.trees |
Integer specifying the number of additional trees to add
to |
data |
An optional data frame containing the variables in the model. By
default the variables are taken from |
weights |
An optional vector of weights to be used in the fitting
process. Must be positive but do not need to be normalized. If
|
offset |
A vector of offset values. |
verbose |
Logical indicating whether or not to print out progress and
performance indicators ( |
A gbm.object
object.
# # A least squares regression example # # Simulate data set.seed(101) # for reproducibility N <- 1000 X1 <- runif(N) X2 <- 2 * runif(N) X3 <- ordered(sample(letters[1:4], N, replace = TRUE), levels = letters[4:1]) X4 <- factor(sample(letters[1:6], N, replace = TRUE)) X5 <- factor(sample(letters[1:3], N, replace = TRUE)) X6 <- 3 * runif(N) mu <- c(-1, 0, 1, 2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu sigma <- sqrt(var(Y) / SNR) Y <- Y + rnorm(N, 0, sigma) X1[sample(1:N,size=500)] <- NA # introduce some missing values X4[sample(1:N,size=300)] <- NA # introduce some missing values data <- data.frame(Y, X1, X2, X3, X4, X5, X6) # Fit a GBM set.seed(102) # for reproducibility gbm1 <- gbm(Y ~ ., data = data, var.monotone = c(0, 0, 0, 0, 0, 0), distribution = "gaussian", n.trees = 100, shrinkage = 0.1, interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5, n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE, verbose = FALSE, n.cores = 1) # Check performance using the out-of-bag (OOB) error; the OOB error typically # underestimates the optimal number of iterations best.iter <- gbm.perf(gbm1, method = "OOB") print(best.iter) # Check performance using the 50% heldout test set best.iter <- gbm.perf(gbm1, method = "test") print(best.iter) # Check performance using 5-fold cross-validation best.iter <- gbm.perf(gbm1, method = "cv") print(best.iter) # Plot relative influence of each variable par(mfrow = c(1, 2)) summary(gbm1, n.trees = 1) # using first tree summary(gbm1, n.trees = best.iter) # using estimated best number of trees # Compactly print the first and last trees for curiosity print(pretty.gbm.tree(gbm1, i.tree = 1)) print(pretty.gbm.tree(gbm1, i.tree = gbm1$n.trees)) # Simulate new data set.seed(103) # for reproducibility N <- 1000 X1 <- runif(N) X2 <- 2 * runif(N) X3 <- ordered(sample(letters[1:4], N, replace = TRUE)) X4 <- factor(sample(letters[1:6], N, replace = TRUE)) X5 <- factor(sample(letters[1:3], N, replace = TRUE)) X6 <- 3 * runif(N) mu <- c(-1, 0, 1, 2)[as.numeric(X3)] Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu + rnorm(N, 0, sigma) data2 <- data.frame(Y, X1, X2, X3, X4, X5, X6) # Predict on the new data using the "best" number of trees; by default, # predictions will be on the link scale Yhat <- predict(gbm1, newdata = data2, n.trees = best.iter, type = "link") # least squares error print(sum((data2$Y - Yhat)^2)) # Construct univariate partial dependence plots plot(gbm1, i.var = 1, n.trees = best.iter) plot(gbm1, i.var = 2, n.trees = best.iter) plot(gbm1, i.var = "X3", n.trees = best.iter) # can use index or name # Construct bivariate partial dependence plots plot(gbm1, i.var = 1:2, n.trees = best.iter) plot(gbm1, i.var = c("X2", "X3"), n.trees = best.iter) plot(gbm1, i.var = 3:4, n.trees = best.iter) # Construct trivariate partial dependence plots plot(gbm1, i.var = c(1, 2, 6), n.trees = best.iter, continuous.resolution = 20) plot(gbm1, i.var = 1:3, n.trees = best.iter) plot(gbm1, i.var = 2:4, n.trees = best.iter) plot(gbm1, i.var = 3:5, n.trees = best.iter) # Add more (i.e., 100) boosting iterations to the ensemble gbm2 <- gbm.more(gbm1, n.new.trees = 100, verbose = FALSE)
# # A least squares regression example # # Simulate data set.seed(101) # for reproducibility N <- 1000 X1 <- runif(N) X2 <- 2 * runif(N) X3 <- ordered(sample(letters[1:4], N, replace = TRUE), levels = letters[4:1]) X4 <- factor(sample(letters[1:6], N, replace = TRUE)) X5 <- factor(sample(letters[1:3], N, replace = TRUE)) X6 <- 3 * runif(N) mu <- c(-1, 0, 1, 2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu sigma <- sqrt(var(Y) / SNR) Y <- Y + rnorm(N, 0, sigma) X1[sample(1:N,size=500)] <- NA # introduce some missing values X4[sample(1:N,size=300)] <- NA # introduce some missing values data <- data.frame(Y, X1, X2, X3, X4, X5, X6) # Fit a GBM set.seed(102) # for reproducibility gbm1 <- gbm(Y ~ ., data = data, var.monotone = c(0, 0, 0, 0, 0, 0), distribution = "gaussian", n.trees = 100, shrinkage = 0.1, interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5, n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE, verbose = FALSE, n.cores = 1) # Check performance using the out-of-bag (OOB) error; the OOB error typically # underestimates the optimal number of iterations best.iter <- gbm.perf(gbm1, method = "OOB") print(best.iter) # Check performance using the 50% heldout test set best.iter <- gbm.perf(gbm1, method = "test") print(best.iter) # Check performance using 5-fold cross-validation best.iter <- gbm.perf(gbm1, method = "cv") print(best.iter) # Plot relative influence of each variable par(mfrow = c(1, 2)) summary(gbm1, n.trees = 1) # using first tree summary(gbm1, n.trees = best.iter) # using estimated best number of trees # Compactly print the first and last trees for curiosity print(pretty.gbm.tree(gbm1, i.tree = 1)) print(pretty.gbm.tree(gbm1, i.tree = gbm1$n.trees)) # Simulate new data set.seed(103) # for reproducibility N <- 1000 X1 <- runif(N) X2 <- 2 * runif(N) X3 <- ordered(sample(letters[1:4], N, replace = TRUE)) X4 <- factor(sample(letters[1:6], N, replace = TRUE)) X5 <- factor(sample(letters[1:3], N, replace = TRUE)) X6 <- 3 * runif(N) mu <- c(-1, 0, 1, 2)[as.numeric(X3)] Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu + rnorm(N, 0, sigma) data2 <- data.frame(Y, X1, X2, X3, X4, X5, X6) # Predict on the new data using the "best" number of trees; by default, # predictions will be on the link scale Yhat <- predict(gbm1, newdata = data2, n.trees = best.iter, type = "link") # least squares error print(sum((data2$Y - Yhat)^2)) # Construct univariate partial dependence plots plot(gbm1, i.var = 1, n.trees = best.iter) plot(gbm1, i.var = 2, n.trees = best.iter) plot(gbm1, i.var = "X3", n.trees = best.iter) # can use index or name # Construct bivariate partial dependence plots plot(gbm1, i.var = 1:2, n.trees = best.iter) plot(gbm1, i.var = c("X2", "X3"), n.trees = best.iter) plot(gbm1, i.var = 3:4, n.trees = best.iter) # Construct trivariate partial dependence plots plot(gbm1, i.var = c(1, 2, 6), n.trees = best.iter, continuous.resolution = 20) plot(gbm1, i.var = 1:3, n.trees = best.iter) plot(gbm1, i.var = 2:4, n.trees = best.iter) plot(gbm1, i.var = 3:5, n.trees = best.iter) # Add more (i.e., 100) boosting iterations to the ensemble gbm2 <- gbm.more(gbm1, n.new.trees = 100, verbose = FALSE)
These are objects representing fitted gbm
s.
initF |
The "intercept" term, the initial predicted value to which trees make adjustments. |
fit |
A vector containing the fitted values on the scale of regression function (e.g. log-odds scale for bernoulli, log scale for poisson). |
train.error |
A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the training data. |
valid.error |
A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the validation data. |
cv.error |
If |
oobag.improve |
A vector of
length equal to the number of fitted trees containing an out-of-bag estimate
of the marginal reduction in the expected value of the loss function. The
out-of-bag estimate uses only the training data and is useful for estimating
the optimal number of boosting iterations. See |
trees |
A list containing the tree structures. The components are best
viewed using |
c.splits |
A list of all
the categorical splits in the collection of trees. If the |
cv.fitted |
If cross-validation was performed, the cross-validation predicted values on the scale of the linear predictor. That is, the fitted values from the i-th CV-fold, for the model having been trained on the data in all other folds. |
The following components must be included in a
legitimate gbm
object.
Greg Ridgeway [email protected]
Estimates the optimal number of boosting iterations for a gbm
object
and optionally plots various performance measures
gbm.perf(object, plot.it = TRUE, oobag.curve = FALSE, overlay = TRUE, method)
gbm.perf(object, plot.it = TRUE, oobag.curve = FALSE, overlay = TRUE, method)
object |
A |
plot.it |
An indicator of whether or not to plot the performance
measures. Setting |
oobag.curve |
Indicates whether to plot the out-of-bag performance measures in a second plot. |
overlay |
If TRUE and oobag.curve=TRUE then a right y-axis is added to the training and test error plot and the estimated cumulative improvement in the loss function is plotted versus the iteration number. |
method |
Indicate the method used to estimate the optimal number of
boosting iterations. |
gbm.perf
Returns the estimated optimal number of iterations.
The method of computation depends on the method
argument.
Greg Ridgeway [email protected]
Functions to compute Information Retrieval measures for pairwise loss for a single group. The function returns the respective metric, or a negative value if it is undefined for the given group.
gbm.roc.area(obs, pred) gbm.conc(x) ir.measure.conc(y.f, max.rank = 0) ir.measure.auc(y.f, max.rank = 0) ir.measure.mrr(y.f, max.rank) ir.measure.map(y.f, max.rank = 0) ir.measure.ndcg(y.f, max.rank) perf.pairwise(y, f, group, metric = "ndcg", w = NULL, max.rank = 0)
gbm.roc.area(obs, pred) gbm.conc(x) ir.measure.conc(y.f, max.rank = 0) ir.measure.auc(y.f, max.rank = 0) ir.measure.mrr(y.f, max.rank) ir.measure.map(y.f, max.rank = 0) ir.measure.ndcg(y.f, max.rank) perf.pairwise(y, f, group, metric = "ndcg", w = NULL, max.rank = 0)
obs |
Observed value. |
pred |
Predicted value. |
x |
Numeric vector. |
y , y.f , f , w , group , max.rank
|
Used internally. |
metric |
What type of performance measure to compute. |
For simplicity, we have no special handling for ties; instead, we break ties randomly. This is slightly inaccurate for individual groups, but should have only a small effect on the overall measure.
gbm.conc
computes the concordance index: Fraction of all pairs (i,j)
with i<j, x[i] != x[j], such that x[j] < x[i]
If obs
is binary, then gbm.roc.area(obs, pred) =
gbm.conc(obs[order(-pred)])
.
gbm.conc
is more general as it allows non-binary targets, but is
significantly slower.
The requested performance measure.
Stefan Schroedl
C. Burges (2010). "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-TR-2010-82.
Functions for cross-validating gbm. These functions are used internally and are not intended for end-user direct usage.
gbmCrossVal( cv.folds, nTrain, n.cores, class.stratify.cv, data, x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, n.minobsinnode, shrinkage, bag.fraction, var.names, response.name, group ) gbmCrossValErr(cv.models, cv.folds, cv.group, nTrain, n.trees) gbmCrossValPredictions( cv.models, cv.folds, cv.group, best.iter.cv, distribution, data, y ) gbmCrossValModelBuild( cv.folds, cv.group, n.cores, i.train, x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, n.minobsinnode, shrinkage, bag.fraction, var.names, response.name, group ) gbmDoFold( X, i.train, x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, n.minobsinnode, shrinkage, bag.fraction, cv.group, var.names, response.name, group, s )
gbmCrossVal( cv.folds, nTrain, n.cores, class.stratify.cv, data, x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, n.minobsinnode, shrinkage, bag.fraction, var.names, response.name, group ) gbmCrossValErr(cv.models, cv.folds, cv.group, nTrain, n.trees) gbmCrossValPredictions( cv.models, cv.folds, cv.group, best.iter.cv, distribution, data, y ) gbmCrossValModelBuild( cv.folds, cv.group, n.cores, i.train, x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, n.minobsinnode, shrinkage, bag.fraction, var.names, response.name, group ) gbmDoFold( X, i.train, x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, n.minobsinnode, shrinkage, bag.fraction, cv.group, var.names, response.name, group, s )
cv.folds |
The number of cross-validation folds. |
nTrain |
The number of training samples. |
n.cores |
The number of cores to use. |
class.stratify.cv |
Whether or not stratified cross-validation samples are used. |
data |
The data. |
x |
The model matrix. |
y |
The response variable. |
offset |
The offset. |
distribution |
The type of loss function. See |
w |
Observation weights. |
var.monotone |
See |
n.trees |
The number of trees to fit. |
interaction.depth |
The degree of allowed interactions. See
|
n.minobsinnode |
See |
shrinkage |
See |
bag.fraction |
See |
var.names |
See |
response.name |
See |
group |
Used when |
cv.models |
A list containing the models for each fold. |
cv.group |
A vector indicating the cross-validation fold for each member of the training set. |
best.iter.cv |
The iteration with lowest cross-validation error. |
i.train |
Items in the training set. |
X |
Index (cross-validation fold) on which to subset. |
s |
Random seed. |
These functions are not intended for end-user direct usage, but are used
internally by gbm
.
A list containing the cross-validation error and predictions.
Greg Ridgeway [email protected]
J.H. Friedman (2001). "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics 29(5):1189-1232.
L. Breiman (2001). https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf.
Helper functions for preprocessing data prior to building a "gbm"
object.
guessDist(y) getCVgroup(distribution, class.stratify.cv, y, i.train, cv.folds, group) getStratify(strat, d) checkMissing(x, y) checkWeights(w, n) checkID(id) checkOffset(o, y) getVarNames(x) gbmCluster(n)
guessDist(y) getCVgroup(distribution, class.stratify.cv, y, i.train, cv.folds, group) getStratify(strat, d) checkMissing(x, y) checkWeights(w, n) checkID(id) checkOffset(o, y) getVarNames(x) gbmCluster(n)
y |
The response variable. |
class.stratify.cv |
Whether or not to stratify, if provided by the user. |
i.train |
Computed internally by |
cv.folds |
The number of cross-validation folds. |
group |
The group, if using |
strat |
Whether or not to stratify. |
d , distribution
|
The distribution, either specified by the user or implied. |
x |
The design matrix. |
w |
The weights. |
n |
The number of cores to use in the cluster. |
id |
The interaction depth. |
o |
The offset. |
These are functions used internally by gbm
and not intended for direct
use by the user.
Computes Friedman's H-statistic to assess the strength of variable interactions.
interact.gbm(x, data, i.var = 1, n.trees = x$n.trees)
interact.gbm(x, data, i.var = 1, n.trees = x$n.trees)
x |
A |
data |
The dataset used to construct |
i.var |
A vector of indices or the names of the variables for compute
the interaction effect. If using indices, the variables are indexed in the
same order that they appear in the initial |
n.trees |
The number of trees used to generate the plot. Only the first
|
interact.gbm
computes Friedman's H-statistic to assess the relative
strength of interaction effects in non-linear models. H is on the scale of
[0-1] with higher values indicating larger interaction effects. To connect
to a more familiar measure, if and
are uncorrelated
covariates with mean 0 and variance 1 and the model is of the form
then
Note that if the main effects are weak, the estimated H will be unstable. For example, if (in the case of a two-way interaction) neither main effect is in the selected model (relative influence is zero), the result will be 0/0. Also, with weak main effects, rounding errors can result in values of H > 1 which are not possible.
Returns the value of .
Greg Ridgeway [email protected]
J.H. Friedman and B.E. Popescu (2005). “Predictive Learning via Rule Ensembles.” Section 8.1
Plots the marginal effect of the selected variables by "integrating" out the other variables.
## S3 method for class 'gbm' plot( x, i.var = 1, n.trees = x$n.trees, continuous.resolution = 100, return.grid = FALSE, type = c("link", "response"), level.plot = TRUE, contour = FALSE, number = 4, overlap = 0.1, col.regions = viridis::viridis, ... )
## S3 method for class 'gbm' plot( x, i.var = 1, n.trees = x$n.trees, continuous.resolution = 100, return.grid = FALSE, type = c("link", "response"), level.plot = TRUE, contour = FALSE, number = 4, overlap = 0.1, col.regions = viridis::viridis, ... )
x |
A |
i.var |
Vector of indices or the names of the variables to plot. If
using indices, the variables are indexed in the same order that they appear
in the initial |
n.trees |
Integer specifying the number of trees to use to generate the
plot. Default is to use |
continuous.resolution |
Integer specifying the number of equally space points at which to evaluate continuous predictors. |
return.grid |
Logical indicating whether or not to produce graphics
|
type |
Character string specifying the type of prediction to plot on the
vertical axis. See |
level.plot |
Logical indicating whether or not to use a false color
level plot ( |
contour |
Logical indicating whether or not to add contour lines to the
level plot. Only used when |
number |
Integer specifying the number of conditional intervals to use
for the continuous panel variables. See |
overlap |
The fraction of overlap of the conditioning variables. See
|
col.regions |
Color vector to be used if |
... |
Additional optional arguments to be passed onto
|
plot.gbm
produces low dimensional projections of the
gbm.object
by integrating out the variables not included in
the i.var
argument. The function selects a grid of points and uses
the weighted tree traversal method described in Friedman (2001) to do the
integration. Based on the variable types included in the projection,
plot.gbm
selects an appropriate display choosing amongst line plots,
contour plots, and lattice
plots. If the default
graphics are not sufficient the user may set return.grid = TRUE
, store
the result of the function, and develop another graphic display more
appropriate to the particular example.
If return.grid = TRUE
, a grid of evaluation points and their
average predictions. Otherwise, a plot is returned.
More flexible plotting is available using the
partial
and plotPartial
functions.
J. H. Friedman (2001). "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics 29(4).
B. M. Greenwell (2017). "pdp: An R Package for Constructing Partial Dependence Plots," The R Journal 9(1), 421–436. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.
partial
, plotPartial
,
gbm
, and gbm.object
.
Predicted values based on a generalized boosted model object
## S3 method for class 'gbm' predict(object, newdata, n.trees, type = "link", single.tree = FALSE, ...)
## S3 method for class 'gbm' predict(object, newdata, n.trees, type = "link", single.tree = FALSE, ...)
object |
Object of class inheriting from ( |
newdata |
Data frame of observations for which to make predictions |
n.trees |
Number of trees used in the prediction. |
type |
The scale on which gbm makes the predictions |
single.tree |
If |
... |
further arguments passed to or from other methods |
predict.gbm
produces predicted values for each observation in
newdata
using the the first n.trees
iterations of the boosting
sequence. If n.trees
is a vector than the result is a matrix with
each column representing the predictions from gbm models with
n.trees[1]
iterations, n.trees[2]
iterations, and so on.
The predictions from gbm
do not include the offset term. The user may
add the value of the offset to the predicted value if desired.
If object
was fit using gbm.fit
there will be no
Terms
component. Therefore, the user has greater responsibility to
make sure that newdata
is of the same format (order and number of
variables) as the one originally used to fit the model.
Returns a vector of predictions. By default the predictions are on the scale of f(x). For example, for the Bernoulli loss the returned value is on the log odds scale, poisson loss on the log scale, and coxph is on the log hazard scale.
If type="response"
then gbm
converts back to the same scale as
the outcome. Currently the only effect this will have is returning
probabilities for bernoulli and expected counts for poisson. For the other
distributions "response" and "link" return the same.
Greg Ridgeway [email protected]
gbm
stores the collection of trees used to construct the model in a
compact matrix structure. This function extracts the information from a
single tree and displays it in a slightly more readable form. This function
is mostly for debugging purposes and to satisfy some users' curiosity.
## S3 method for class 'gbm.tree' pretty(object, i.tree = 1)
## S3 method for class 'gbm.tree' pretty(object, i.tree = 1)
object |
a |
i.tree |
the index of the tree component to extract from |
pretty.gbm.tree
returns a data frame. Each row corresponds to
a node in the tree. Columns indicate
SplitVar |
index of which variable is used to split. -1 indicates a terminal node. |
SplitCodePred |
if the
split variable is continuous then this component is the split point. If the
split variable is categorical then this component contains the index of
|
LeftNode |
the index of the row corresponding to the left node. |
RightNode |
the index of the row corresponding to the right node. |
ErrorReduction |
the reduction in the loss function as a result of splitting this node. |
Weight |
the total weight of observations in the node. If weights are all equal to 1 then this is the number of observations in the node. |
Greg Ridgeway [email protected]
Display basic information about a gbm
object.
## S3 method for class 'gbm' print(x, ...) show.gbm(x, ...)
## S3 method for class 'gbm' print(x, ...) show.gbm(x, ...)
x |
an object of class |
... |
arguments passed to |
Prints some information about the model object. In particular, this method
prints the call to gbm()
, the type of loss function that was used,
and the total number of iterations.
If cross-validation was performed, the 'best' number of trees as estimated by cross-validation error is displayed. If a test set was used, the 'best' number of trees as estimated by the test set error is displayed.
The number of available predictors, and the number of those having non-zero influence on predictions is given (which might be interesting in data mining applications).
If multinomial, bernoulli or adaboost was used, the confusion matrix and prediction accuracy are printed (objects being allocated to the class with highest probability for multinomial and bernoulli). These classifications are performed on the entire training data using the model with the 'best' number of trees as described above, or the maximum number of trees if the 'best' cannot be computed.
If the 'distribution' was specified as gaussian, laplace, quantile or t-distribution, a summary of the residuals is displayed. The residuals are for the training data with the model at the 'best' number of trees, as described above, or the maximum number of trees if the 'best' cannot be computed.
Harry Southworth, Daniel Edwards
data(iris) iris.mod <- gbm(Species ~ ., distribution="multinomial", data=iris, n.trees=2000, shrinkage=0.01, cv.folds=5, verbose=FALSE, n.cores=1) iris.mod #data(lung) #lung.mod <- gbm(Surv(time, status) ~ ., distribution="coxph", data=lung, # n.trees=2000, shrinkage=0.01, cv.folds=5,verbose =FALSE) #lung.mod
data(iris) iris.mod <- gbm(Species ~ ., distribution="multinomial", data=iris, n.trees=2000, shrinkage=0.01, cv.folds=5, verbose=FALSE, n.cores=1) iris.mod #data(lung) #lung.mod <- gbm(Surv(time, status) ~ ., distribution="coxph", data=lung, # n.trees=2000, shrinkage=0.01, cv.folds=5,verbose =FALSE) #lung.mod
Marks the quantiles on the axes of the current plot.
## S3 method for class 'rug' quantile(x, prob = 0:10/10, ...)
## S3 method for class 'rug' quantile(x, prob = 0:10/10, ...)
x |
A numeric vector. |
prob |
The quantiles of x to mark on the x-axis. |
... |
Additional optional arguments to be passed onto
|
No return values.
Greg Ridgeway [email protected].
x <- rnorm(100) y <- rnorm(100) plot(x, y) quantile.rug(x)
x <- rnorm(100) y <- rnorm(100) plot(x, y) quantile.rug(x)
Helper function to reconstitute the data for plots and summaries. This function is not intended for the user to call directly.
reconstructGBMdata(x)
reconstructGBMdata(x)
x |
a |
Returns a data used to fit the gbm in a format that can subsequently be used for plots and summaries
Harry Southworth
Helper functions for computing the relative influence of each variable in the gbm object.
relative.influence(object, n.trees, scale. = FALSE, sort. = FALSE) permutation.test.gbm(object, n.trees) gbm.loss(y, f, w, offset, dist, baseline, group = NULL, max.rank = NULL)
relative.influence(object, n.trees, scale. = FALSE, sort. = FALSE) permutation.test.gbm(object, n.trees) gbm.loss(y, f, w, offset, dist, baseline, group = NULL, max.rank = NULL)
object |
a |
n.trees |
the number of trees to use for computations. If not provided, the the function will guess: if a test set was used in fitting, the number of trees resulting in lowest test set error will be used; otherwise, if cross-validation was performed, the number of trees resulting in lowest cross-validation error will be used; otherwise, all trees will be used. |
scale. |
whether or not the result should be scaled. Defaults to
|
sort. |
whether or not the results should be (reverse) sorted.
Defaults to |
y , f , w , offset , dist , baseline
|
For |
group , max.rank
|
Used internally when |
This is not intended for end-user use. These functions offer the different
methods for computing the relative influence in summary.gbm
.
gbm.loss
is a helper function for permutation.test.gbm
.
By default, returns an unprocessed vector of estimated relative
influences. If the scale.
and sort.
arguments are used,
returns a processed version of the same.
Greg Ridgeway [email protected]
J.H. Friedman (2001). "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics 29(5):1189-1232.
L. Breiman (2001). https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf.
Computes the relative influence of each variable in the gbm object.
## S3 method for class 'gbm' summary( object, cBars = length(object$var.names), n.trees = object$n.trees, plotit = TRUE, order = TRUE, method = relative.influence, normalize = TRUE, ... )
## S3 method for class 'gbm' summary( object, cBars = length(object$var.names), n.trees = object$n.trees, plotit = TRUE, order = TRUE, method = relative.influence, normalize = TRUE, ... )
object |
a |
cBars |
the number of bars to plot. If |
n.trees |
the number of trees used to generate the plot. Only the first
|
plotit |
an indicator as to whether the plot is generated. |
order |
an indicator as to whether the plotted and/or returned relative influences are sorted. |
method |
The function used to compute the relative influence.
|
normalize |
if |
... |
other arguments passed to the plot function. |
For distribution="gaussian"
this returns exactly the reduction of
squared error attributable to each variable. For other loss functions this
returns the reduction attributable to each variable in sum of squared error
in predicting the gradient on each iteration. It describes the relative
influence of each variable in reducing the loss function. See the references
below for exact details on the computation.
Returns a data frame where the first component is the variable name and the second is the computed relative influence, normalized to sum to 100.
Greg Ridgeway [email protected]
J.H. Friedman (2001). "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics 29(5):1189-1232.
L. Breiman (2001).https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf.
gbm
package.Run tests on gbm
functions to perform logical checks and
reproducibility.
test.gbm()
test.gbm()
The function uses functionality in the RUnit
package. A fairly small
validation suite is executed that checks to see that relative influence
identifies sensible variables from simulated data, and that predictions from
GBMs with Gaussian, Cox or binomial distributions are sensible,
An object of class RUnitTestData
. See the help for
RUnit
for details.
The test suite is not comprehensive.
Harry Southworth
# Uncomment the following lines to run - commented out to make CRAN happy #library(RUnit) #val <- validate.texmex() #printHTMLProtocol(val, "texmexReport.html")
# Uncomment the following lines to run - commented out to make CRAN happy #library(RUnit) #val <- validate.texmex() #printHTMLProtocol(val, "texmexReport.html")