`lightgbm::lgb.train()`

creates a series of decision trees
forming an ensemble. Each tree depends on the results of previous trees.
All trees in the ensemble are combined to produce a final prediction.

## Details

For this engine, there are multiple modes: regression and classification

### Tuning Parameters

This model has 6 tuning parameters:

`tree_depth`

: Tree Depth (type: integer, default: -1)`trees`

: # Trees (type: integer, default: 100)`learn_rate`

: Learning Rate (type: double, default: 0.1)`mtry`

: # Randomly Selected Predictors (type: integer, default: see below)`min_n`

: Minimal Node Size (type: integer, default: 20)`loss_reduction`

: Minimum Loss Reduction (type: double, default: 0)

The `mtry`

parameter gives the *number* of predictors that will be
randomly sampled at each split. The default is to use all predictors.

Rather than as a number,
`lightgbm::lgb.train()`

’s `feature_fraction`

argument encodes `mtry`

as the *proportion* of predictors that will be
randomly sampled at each split. parsnip translates `mtry`

, supplied as
the *number* of predictors, to a proportion under the hood. That is, the
user should still supply the argument as `mtry`

to `boost_tree()`

, and
do so in its sense as a number rather than a proportion; before passing
`mtry`

to `lightgbm::lgb.train()`

, parsnip will
convert the `mtry`

value to a proportion.

Note that parsnip’s translation can be overridden via the `counts`

argument, supplied to `set_engine()`

. By default, `counts`

is set to
`TRUE`

, but supplying the argument `counts = FALSE`

allows the user to
supply `mtry`

as a proportion rather than a number.

### Translation from parsnip to the original package (regression)

The **bonsai** extension package is required to fit this model.

```
boost_tree(
mtry = integer(), trees = integer(), tree_depth = integer(),
learn_rate = numeric(), min_n = integer(), loss_reduction = numeric()
) %>%
set_engine("lightgbm") %>%
set_mode("regression") %>%
translate()
```

```
## Boosted Tree Model Specification (regression)
##
## Main Arguments:
## mtry = integer()
## trees = integer()
## min_n = integer()
## tree_depth = integer()
## learn_rate = numeric()
## loss_reduction = numeric()
##
## Computational engine: lightgbm
##
## Model fit template:
## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(),
## feature_fraction_bynode = integer(), num_iterations = integer(),
## min_data_in_leaf = integer(), max_depth = integer(), learning_rate = numeric(),
## min_gain_to_split = numeric(), verbose = -1, num_threads = 0,
## seed = sample.int(10^5, 1), deterministic = TRUE)
```

### Translation from parsnip to the original package (classification)

The **bonsai** extension package is required to fit this model.

```
boost_tree(
mtry = integer(), trees = integer(), tree_depth = integer(),
learn_rate = numeric(), min_n = integer(), loss_reduction = numeric()
) %>%
set_engine("lightgbm") %>%
set_mode("classification") %>%
translate()
```

```
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = integer()
## trees = integer()
## min_n = integer()
## tree_depth = integer()
## learn_rate = numeric()
## loss_reduction = numeric()
##
## Computational engine: lightgbm
##
## Model fit template:
## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(),
## feature_fraction_bynode = integer(), num_iterations = integer(),
## min_data_in_leaf = integer(), max_depth = integer(), learning_rate = numeric(),
## min_gain_to_split = numeric(), verbose = -1, num_threads = 0,
## seed = sample.int(10^5, 1), deterministic = TRUE)
```

`bonsai::train_lightgbm()`

is a wrapper
around `lightgbm::lgb.train()`

(and other
functions) that make it easier to run this model.

### Other details

#### Preprocessing

This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. `{a, c}`

vs `{b, d}`

) when splitting at a node. Dummy variables
are not required for this model.

Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric.

#### Interpreting `mtry`

The `mtry`

argument denotes the number of predictors that will be
randomly sampled at each split when creating tree models.

Some engines, such as `"xgboost"`

, `"xrf"`

, and `"lightgbm"`

, interpret
their analogue to the `mtry`

argument as the *proportion* of predictors
that will be randomly sampled at each split rather than the *count*. In
some settings, such as when tuning over preprocessors that influence the
number of predictors, this parameterization is quite
helpful—interpreting `mtry`

as a proportion means that `[0, 1]`

is
always a valid range for that parameter, regardless of input data.

parsnip and its extensions accommodate this parameterization using the
`counts`

argument: a logical indicating whether `mtry`

should be
interpreted as the number of predictors that will be randomly sampled at
each split. `TRUE`

indicates that `mtry`

will be interpreted in its
sense as a count, `FALSE`

indicates that the argument will be
interpreted in its sense as a proportion.

`mtry`

is a main model argument for
`boost_tree()`

and
`rand_forest()`

, and thus should not have an
engine-specific interface. So, regardless of engine, `counts`

defaults
to `TRUE`

. For engines that support the proportion interpretation
(currently `"xgboost"`

and `"xrf"`

, via the rules package, and
`"lightgbm"`

via the bonsai package) the user can pass the
`counts = FALSE`

argument to `set_engine()`

to supply `mtry`

values
within `[0, 1]`

.

#### Saving fitted model objects

Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.

#### Bagging

The `sample_size`

argument is translated to the `bagging_fraction`

parameter in the `param`

argument of `lgb.train`

. The argument is
interpreted by lightgbm as a *proportion* rather than a count, so bonsai
internally reparameterizes the `sample_size`

argument with
`dials::sample_prop()`

during tuning.

To effectively enable bagging, the user would also need to set the
`bagging_freq`

argument to lightgbm. `bagging_freq`

defaults to 0, which
means bagging is disabled, and a `bagging_freq`

argument of `k`

means
that the booster will perform bagging at every `k`

th boosting iteration.
Thus, by default, the `sample_size`

argument would be ignored without
setting this argument manually. Other boosting libraries, like xgboost,
do not have an analogous argument to `bagging_freq`

and use `k = 1`

when
the analogue to `bagging_fraction`

is in $(0, 1)$. *bonsai will thus
automatically set* `bagging_freq = 1`

*in* `set_engine("lightgbm", ...)`

if `sample_size`

(i.e. `bagging_fraction`

) is not equal to 1 and no
`bagging_freq`

value is supplied. This default can be overridden by
setting the `bagging_freq`

argument to `set_engine()`

manually.

### Examples

The “Introduction to bonsai” article contains
examples of
`boost_tree()`

with the `"lightgbm"`

engine.

### References

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Kuhn, M, and K Johnson. 2013.

*Applied Predictive Modeling*. Springer.