Jul 19, 2024

How PerpetualBooster works

This blog post dives into the inner workings of PerpetualBooster, a system that bypasses the need for hyperparameter optimization. It achieves this with two techniques: step size control and generalization control. The following sections explore how these methods are implemented within the algorithm itself.

Step size control

Step size control is similar to the backtracking line search with three key differences.

In backtracking line search, the algorithm starts with a very large step size and the step size is repeatedly shrinked until the Armijo–Goldstein condition is fulfilled. In other words, it tries to find a large step size with a sufficient decrease of the objective function by decreasing the step size iteratively.

f(x + \alpha p) \leq f(x)+\alpha c m

The first key difference is that we are trying to find the smallest step (instead of largest) to satisfy the condition. We grow the tree and we check the loss decrease after every split. If the loss decrease exceeds a certain threshold, we stop growing the tree and continue with the next boosting round. In other words, we are trying to take the smallest step which achieves a target loss decrease at each boosting round. This can be called forward tracking tree search where alpha is calculated from the user defined budget parameter.

\alpha = 10 \raisebox{3pt}{-budget}

The second key difference is that m parameter is kept constant instead of updating at every step (at every boosting round). m is calculated before the fitting process using the base score (initial constant prediction):

m={\cfrac{1}{n}}\sum loss(y_i, \hat{y})

The third key difference is that the control parameter c is calculated with the budget parameter using the following formula:

r = {\cfrac{10}{budget}}
{reciprocals \nobreakspace of \nobreakspace powers} = {\cfrac{r}{r - 1}}
{truncated \nobreakspace series \nobreakspace sum} = {reciprocals \nobreakspace of \nobreakspace powers} - ({1 + {\cfrac{1}{r}}})
c = {\cfrac{1}{r}} - {truncated \nobreakspace series \nobreakspace sum}

You can have a look at booster.rs and tree.rs files to check the implementation.

Generalization control

The second technique used to eliminate hyperparameter optimization alongside with the step size control is the generalization control. Before each node splitting, the data is split into training and validation sets. We calculate not only training loss but also validation loss using the separate validation set.

loss_{train} = G_{train} w + {\cfrac{1}{2}} H_{train} w^2
w = - {\cfrac {G_{train}} {H_{train}}}
loss_{valid} = G_{valid} w + {\cfrac{1}{2}} H_{valid} w^2

Where G is the sum of the gradient values and H is the sum of the hessian values. w is the weight term.

Using training and validation losses, we calculate the generalization term.

generalization = {\cfrac{loss_{parent} - loss_{train}} {loss_{parent} - loss_{valid}}}

We let the node split if generalization is greater than 1 and stop splitting if generalization is less than 1. In other words, we check if the validation loss decreases compared to parent loss if we make the split. The following charts depict this in a graphical way:

This should give a very rough idea about how we check generalization at each split. Check splitter.rs file for the detailed implementation.

Conclusion

Finding the right balance between fitting the data and avoiding overfitting is crucial for decision tree learning. This is achieved through two control mechanisms: step size control and generalization control. Step size control shines early on. Since the initial splits significantly reduce the loss (error), it's more impactful in the first boosting rounds. In contrast, generalization control becomes more important later. As boosting progresses, trees tend to become shallower because there's less to learn from the data. The algorithm has a built-in stopping mechanism. If it encounters a very simple tree (one split) with poor generalization (less than 1) three times, it halts the boosting process.

/+