Conditional Conformal Prediction — Theoretical Description¶

The Conditional Conformal Prediction (CCP) method ¹ is a model agnostic conformal prediction method which can create adaptative prediction intervals.

In MAPIE, this method has a lot of advantages:

It is model agnostic (it doesn't depend on the model but only on the predictions, unlike CQR).
It can create very adaptative intervals (with a varying width which truly reflects the model uncertainty).
While providing coverage guarantee on all sub-groups of interest (avoiding biases).
With the possibility to inject prior knowledge about the data or the model.

However, we will also see its disadvantages:

The adaptativity depends on the calibrator we use: it can be difficult to choose the correct calibrator, with the best parameters.
The calibration and even more the inference are much longer than for the other methods. We can reduce the inference time using unsafe_approximation=True, but we lose the strong theoretical guarantees and risk a small miscoverage (even if, most of the time, the coverage is achieved).

To conclude, it can create more adaptative intervals than the other methods, but it can be difficult to find the best settings (calibrator type and parameters) and can have a big computational time.

How does it work?¶

Method's intuition¶

We recall that the standard split method estimates the absolute residuals by a constant \(\hat{q}_{n, \alpha}^+\) (which is the quantile of \(\{|Y_i-\hat{\mu}(X_i)|\}_{1 \leq i \leq n}\)). Then, the prediction interval is:

\[ \hat{C}_{n, \alpha}^{\textrm{split}}(X_{n+1}) = \hat{\mu}(X_{n+1}) \pm \hat{q}_{n, \alpha}^+ \]

The idea of the CCP method is to learn, not a constant, but a function \(q(X)\), to have a different interval width depending on the \(X\) value. Then, we would have:

\[ \hat{C}_{n, \alpha}^{\textrm{CCP}}(X_{n+1}) = \hat{\mu}(X_{n+1}) \pm \hat{q}(X_{n+1}) \]

To be able to find the best function, while having some coverage guarantees, we should select this function inside some defined class of functions \(\mathcal{F}\).

This method is motivated by the following equivalence:

\[ \begin{array}{c} \mathbb{P}(Y_{n+1} \in \hat{C} \; | \; X_{n+1}=x) = 1 - \alpha, \quad \text{for all x} \\ \Longleftrightarrow \\ \mathbb{E} \left[ f(X_{n+1}) \mathbb{I} \left\{ Y_{n+1} \in \hat{C}(X_{n+1}) \right\} \right] = 0, \quad \text{for all measurable f} \\ \end{array} \]

This is the equation corresponding to the perfect conditional coverage, which is theoretically impossible to obtain. Then, relaxing this objective by replacing "all measurable f" with "all f belonging to some class \(\mathcal{F}\)" seems a way to get close to the perfect conditional coverage.

The method follows 3 steps¶

Choose a class of functions. The simple approach is to choose a class of finite dimension \(d \in \mathbb{N}\), using, for any \(\Phi \; : \; \mathbb{R}^d \to \mathbb{R}\) (chosen by the user):

\[ \mathcal{F} = \left\{ \Phi (\cdot)^T \beta : \beta \in \mathbb{R}^d \right\} \]
Find the best function of this class by solving the following optimization problem:

Note

It is actually a quantile regression between the transformation \(\Phi (X)\) and the conformity scores \(S\).

Considering an upper bound \(M\) of the conformity scores, such as \(S_{n+1} < M\):

\[ \hat{g}_M^{n+1} := \arg\min_{g \in \mathcal{F}} \; \frac{1}{n+1} \sum_{i=1}^n{l_{\alpha} (g(X_i), S_i)} \; + \frac{1}{n+1}l_{\alpha} (g(X_{n+1}), M) \]

Warning

In the API, we use by default \(M=\max(\{S_i\}_{i\leq n})\), the maximum conformity score of the calibration set, but you can specify it yourself if a bound is known, considering your data, model and conformity score.

Moreover, it means that there are still small computations which are done for each test point \(X_{n+1}\). If you want to avoid that, you can use unsafe_approximation=True, which only considers:

\[ \hat{g} := \arg\min_{g \in \mathcal{F}} \; \frac{1}{n} \sum_{i=1}^n{l_{\alpha} (g(X_i), S_i)} \]

However, it may result in a small miscoverage. It is recommended to empirically check the resulting coverage on the test set.
We use this optimized function \(\hat{g}_M^{n+1}\) to compute the prediction intervals:

\[ \hat{C}_M^{n+1}(X_{n+1}) = \{ y : S(X_{n+1}, \: y) \leq \hat{g}_M^{n+1}(X_{n+1}) \} \]

Note

The formulas are generic and work with all conformity scores. But in the case of the absolute residuals, we get:

\[ \hat{C}(X_{n+1}) = \hat{\mu}(X_{n+1}) \pm \hat{g}_M^{n+1}(X_{n+1}) \]

Coverage guarantees¶

Warning

The following guarantees assume that the approximation described above is not used, and that the chosen bound \(M\) is indeed such as \(\forall \text{ test index } i, \; S_i < M\).

Following these steps, we have the coverage guarantee, \(\forall f \in \mathcal{F}\):

\[ \mathbb{P}_f(Y_{n+1} \in \hat{C}_M^{n+1}(X_{n+1})) \geq 1 - \alpha \]

\[ \text{and} \quad \left | \mathbb{E} \left[ f(X_{n+1}) \left(\mathbb{I} \left\{ Y_{n+1} \in \hat{C}_M^{n+1}(X_{n+1}) \right\} - (1 - \alpha) \right) \right] \right | \leq \frac{d}{n+1} \mathbb{E} \left[ \max_{1 \leq i \leq n+1} \left|f(X_i)\right| \right] \]

Note

If we want to have a homogeneous coverage on some given groups in \(\mathcal{G}\), we can use \(\mathcal{F} = \{ x \mapsto \sum_{G \in \mathcal{G}} \; \beta_G \mathbb{I} \{ x \in G \} : \beta_G \in \mathbb{R} \}\), then we have \(\forall G \in \mathcal{G}\):

\[ \begin{aligned} 1 - \alpha &\leq \mathbb{P} \left( Y_{n+1} \in \hat{C}_M^{n+1}(X_{n+1}) \; | \; X_{n+1} \in G \right) \\ &\leq 1- \alpha + \frac{|\mathcal{G}|}{(n+1) \mathbb{P}(X_{n+1} \in G)} \\ &= 1- \alpha + \frac{\text{number of groups in } \mathcal{G}}{\text{number of samples of } \{X_i\} \text{ in } G} \end{aligned} \]

How to use it in practice?¶

Creating a class of functions adapted to our needs¶

The following will provide some tips on how to use the method. For practical examples, see the regression and classification examples using ConditionalSplitConformalRegressor and ConditionalSplitConformalClassifier.

The class of functions is defined with feature_map, passed directly to the conditional estimator. This function returns the \(\Phi(X)\) matrix used by the method.
If you want to avoid bias on sub-groups and ensure a homogeneous coverage on those, you can add indicator functions corresponding to those groups in feature_map.
You can inject prior knowledge in the method through feature_map, if you have information about the conformity scores distribution (domains with different behavior, expected model uncertainty depending on a given feature, etc.).
Empirically test the obtained coverage on a test set, to make sure that the expected coverage is achieved.

Avoid miscoverage¶

To guarantee marginal coverage, you need to have an intercept term in the \(\Phi\) function (meaning, a feature equal to \(1\) for all \(X_i\)).
Keep the number of dimensions \(d\) reasonable compared with the conformalization set size.

References¶

Isaac Gibbs, John J. Cherian, and Emmanuel J. Candès (2023). Conformal Prediction With Conditional Guarantees. arXiv:2305.12616. ↩