Theoretical Description¶
The mapie.regression.MapieRegressor
class uses various
resampling methods based on the jackknife strategy
recently introduced by Foygel-Barber et al. (2020) [1].
They allow the user to estimate robust prediction intervals with any kind of
machine learning model for regression purposes on single-output data.
We give here a brief theoretical description of the methods included in the module.
Before describing the methods, let’s briefly present the mathematical setting. For a regression problem in a standard independent and identically distributed (i.i.d) case, our training data has an unknown distribution . We can assume that where is the model function we want to determine and is the noise. Given some target quantile or associated target coverage level , we aim at constructing a prediction interval for a new feature vector such that
1. The “Naive” method¶
The so-called naive method computes the residuals of the training data to estimate the typical error obtained on a new test data point. The prediction interval is therefore given by the prediction obtained by the model trained on the entire training set the quantiles of the residuals of the same training set:
or
where is the quantile of the distribution.
Since this method estimates the residuals only on the training set, it tends to be too optimistic and under-estimates the width of prediction intervals because of a potential overfit. As a result, the probability that a new point lies in the interval given by the naive method would be lower than the target level .
The figure below illustrates the Naive method.
2. The jackknife method¶
The standard jackknife method is based on the construction of a set of leave-one-out models. Estimating the prediction intervals is carried out in three main steps:
For each instance i = 1, …, n of the training set, we fit the regression function on the entire training set with the point removed, resulting in n leave-one-out models.
The corresponding leave-one-out residual is computed for each point .
We fit the regression function on the entire training set and we compute the prediction interval using the computed leave-one-out residuals:
The resulting confidence interval can therefore be summarized as follows
where
is the leave-one-out residual.
This method avoids the overfitting problem but can lose its predictive cover when becomes unstable, for example when the sample size is closed to the number of features (as seen in the “Reproducing the simulations from Foygel-Barber et al. (2020)” example).
3. The jackknife+ method¶
Unlike the standard jackknife method which estimates a prediction interval centered around the prediction of the model trained on the entire dataset, the so-called jackknife+ method uses each leave-one-out prediction on the new test point to take the variability of the regression function into account. The resulting confidence interval can therefore be summarized as follows
As described in [1], this method garantees a higher stability with a coverage level of for a target coverage level of , without any a priori assumption on the distribution of the data nor on the predictive model.
4. The jackknife-minmax method¶
The jackknife-minmax method offers a slightly more conservative alternative since it uses the minimal and maximal values of the leave-one-out predictions to compute the prediction intervals. The estimated prediction intervals can be defined as follows
As justified by [1], this method garantees a coverage level of for a target coverage level of .
The figure below, adapted from Fig. 1 of [1], illustrates the three jackknife methods and emphasizes their main differences.
However, the jackknife, jackknife+ and jackknife-minmax methods are computationally heavy since they require to run as many simulations as the number of training points, which is prohibitive for a typical data science use case.
5. The CV+ method¶
In order to reduce the computational time, one can adopt a cross-validation approach instead of a leave-one-out approach, called the CV+ method.
By analogy with the jackknife+ method, estimating the prediction intervals with CV+ is performed in four main steps:
We split the training set into K disjoint subsets of equal size.
K regression functions are fitted on the training set with the corresponding fold removed.
The corresponding out-of-fold residual is computed for each point where k(i) is the fold containing i.
Similar to the jackknife+, the regression functions are used to estimate the prediction intervals.
As for jackknife+, this method garantees a coverage level higher than for a target coverage level of , without any a priori assumption on the distribution of the data. As noted by [1], the jackknife+ can be viewed as a special case of the CV+ in which . In practice, this method results in slightly wider prediction intervals and is therefore more conservative, but gives a reasonable compromise for large datasets when the Jacknife+ method is unfeasible.
6. The CV and CV-minmax methods¶
By analogy with the standard jackknife and jackknife-minmax methods, the CV and CV-minmax approaches are also included in MAPIE. As for the CV+ method, they rely on out-of-fold regression models that are used to compute the prediction intervals but using the equations given in the jackknife and jackknife-minmax sections.
The figure below, adapted from Fig. 1 of [1], illustrates the three CV methods and emphasizes their main differences.
7. The jackknife+-after-bootstrap method¶
In order to reduce the computational time, and get more robust predictions, one can adopt a bootstrap approach instead of a leave-one-out approach, called the jackknife+-after-bootstrap method, offered by Kim and al. [2].
By analogy with the CV+ method, estimating the prediction intervals with jackknife+-after-bootstrap is performed in four main steps:
We resample the training set with replacement (boostrap) times, and thus we get the (non disjoint) bootstraps of equal size.
regressions functions are then fitted on the bootstraps , and the predictions on the complementary sets are computed.
These predictions are aggregated according to a given aggregation function , typically or , and the residuals are computed for each (with the boostraps not containing ).
The sets (where indexes the training set) are used to estimate the prediction intervals.
As for jackknife+, this method guarantees a coverage level higher than for a target coverage level of , without any a priori assumption on the distribution of the data. In practice, this method results in wider prediction intervals, when the uncertainty is higher, than , because the models’ prediction spread is then higher.
Key takeaways¶
The jackknife+ method introduced by [1] allows the user to easily obtain theoretically guaranteed prediction intervals for any kind of sklearn-compatible Machine Learning regressor.
Since the typical coverage levels estimated by jackknife+ follow very closely the target coverage levels, this method should be used when accurate and robust prediction intervals are required.
For practical applications where is large and/or the computational time of each leave-one-out simulation is high, it is advised to adopt the CV+ method, based on out-of-fold simulations, or the jackknife+-after-bootstrap method, instead. Indeed, the methods based on the jackknife resampling approach are very cumbersome because they require to run a high number of simulations, equal to the number of training samples .
Although the CV+ method results in prediction intervals that are slightly larger than for the jackknife+ method, it offers a good compromise between computational time and accurate predictions.
The jackknife+-after-bootstrap method results in the same computational efficiency, and offers a higher sensitivity to epistemic uncertainty.
The jackknife-minmax and CV-minmax methods are more conservative since they result in higher theoretical and practical coverages due to the larger widths of the prediction intervals. It is therefore advised to use them when conservative estimates are needed.
The table below summarizes the key features of each method by focusing on the obtained coverages and the computational cost. , , and are the number of training samples, test samples, and cross-validated folds, respectively.
Method |
Theoretical coverage |
Typical coverage |
Training cost |
Evaluation cost |
---|---|---|---|---|
Naïve |
No guarantee |
1 |
||
Jackknife |
No guarantee |
|||
or if unstable |
||||
Jackknife+ |
||||
Jackknife-minmax |
||||
CV |
No guarantee |
|||
or if unstable |
||||
CV+ |
||||
CV-minmax |
||||
Jackknife-aB+ |
||||
Jackknife-aB-minmax |
- *
Here, the training and evaluation costs correspond to the computational time of the MAPIE
.fit()
and.predict()
methods.
References¶
[1] Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. “Predictive inference with the jackknife+.” Ann. Statist., 49(1):486–507, February 2021.
[2] Byol Kim, Chen Xu, and Rina Foygel Barber. “Predictive Inference Is Free with the Jackknife+-after-Bootstrap.” 34th Conference on Neural Information Processing Systems (NeurIPS 2020).