How PerpetualBooster works
This blog post dives into the inner workings of PerpetualBooster, a system that bypasses the need for hyperparameter optimization. It achieves this with two techniques: step size control and generalization control. The following sections explore how these methods are implemented within the algorithm itself.
Step size control
Step size control is similar to the backtracking line search with three key differences.
In backtracking line search, the algorithm starts with a very large step size and the step size is repeatedly shrinked until the Armijo–Goldstein condition is fulfilled. In other words, it tries to find a large step size with a sufficient decrease of the objective function by decreasing the step size iteratively.
The first key difference is that we are trying to find the smallest step
(instead of largest) to satisfy the condition. We grow the tree and we check
the loss decrease after every split. If the loss decrease exceeds a certain
threshold, we stop growing the tree and continue with the next boosting round.
In other words, we are trying to take the smallest step which achieves a
target loss decrease at each boosting round. This can be called forward
tracking tree search where alpha
is calculated from the user
defined budget
parameter.
The second key difference is that m
parameter is kept constant
instead of updating at every step (at every boosting round). m
is
calculated before the fitting process using the base score (initial constant prediction):
The third key difference is that the control parameter c
is
calculated with the budget
parameter using the following formula:
You can have a look at booster.rs and tree.rs files to check the implementation.
Generalization control
The second technique used to eliminate hyperparameter optimization alongside with the step size control is the generalization control. Before each node splitting, the data is split into training and validation sets. We calculate not only training loss but also validation loss using the separate validation set.
Where G is the sum of the gradient values and H is the sum of the hessian values. w is the weight term.
Using training and validation losses, we calculate the generalization term.
We let the node split if generalization is greater than 1 and stop splitting if generalization is less than 1. In other words, we check if the validation loss decreases compared to parent loss if we make the split. The following charts depict this in a graphical way:
This should give a very rough idea about how we check generalization at each split. Check splitter.rs file for the detailed implementation.
Conclusion
Finding the right balance between fitting the data and avoiding overfitting is crucial for decision tree learning. This is achieved through two control mechanisms: step size control and generalization control. Step size control shines early on. Since the initial splits significantly reduce the loss (error), it's more impactful in the first boosting rounds. In contrast, generalization control becomes more important later. As boosting progresses, trees tend to become shallower because there's less to learn from the data. The algorithm has a built-in stopping mechanism. If it encounters a very simple tree (one split) with poor generalization (less than 1) three times, it halts the boosting process.