Theoretical Description

Two methods for multi-label uncertainty-quantification have been implemented in MAPIE so far : Risk-Controlling Prediction Sets (RCPS) [1] and Conformal Risk Control (CRC) [2]. The difference between these methods is the way the conformity scores are computed.

For a multi-label classification problem in a standard independent and identically distributed (i.i.d) case, our training data (X, Y) = \{(x_1, y_1), \ldots, (x_n, y_n)\} has an unknown distribution P_{X, Y}.

For any risk level \alpha between 0 and 1, the methods implemented in MAPIE allow the user to construct a prediction set \hat{C}_{n, \alpha}(X_{n+1}) for a new observation \left( X_{n+1},Y_{n+1} \right) with a guarantee on the recall. RCPS and CRC give two slightly different guarantees:

  • RCPS:

\mathbb{P}(R(\mathcal{T}_{\hat{\lambda}}) \leq \alpha ) \geq 1 - \delta

  • CRC:

\mathbb{E}\left[L_{n+1}(\hat{\lambda})\right] \leq \alpha

1. Risk-Controlling Prediction Sets

1.1. General settings

Let’s first give the settings and the notations of the method:

  • Let \mathcal{T}_{\hat{\lambda}}: X \longrightarrow Y' be a set-valued function (a tolerance region) that maps a feature vector to a set-valued prediction. This function is built from the model which was previously fitted on the training data. It is indexed by a one-dimensional parameter \lambda which is taking values in \Lambda \subset \mathbb{R} \cup \{ \pm \infty \} such that:

\lambda_1 < \lambda_2 \Rightarrow \mathcal{T}_{\lambda_1}(x) \subset \mathcal{T}_{\lambda_2}(x)

  • Let L: Y\times Y' \longrightarrow \mathbb{R}^+ be a loss function on a prediction set with the following nesting property:

S_1 \subset S_2 \Rightarrow L(y, S_1) \geq L(y, S_2)

  • Let R be the risk associated to a set-valued predictor:

R(\mathcal{T}_{\hat{\lambda}}) = \mathbb{E}[L(Y, \mathcal{T}_{\lambda}(X)]

The goal of the method is to compute an Upper Confidence Bound (UCB) \hat{R}^+(\lambda) of R(\lambda) and then to find \hat{\lambda} as follows:

\hat{\lambda} = \inf\{\lambda \in \Lambda: \hat{R}^+(\lambda ') < \alpha, \forall \lambda ' \geq \lambda \}

The figure bellow explains this procedure:

_images/r_hat_plus.png

Following those settings, the RCPS method gives the following guarantee on the recall:

\mathbb{P}(R(\mathcal{T}_{\hat{\lambda}}) \leq \alpha ) \geq 1 - \delta

1.2. Bounds calculation

In this section, we will consider only bounded losses (as for now, only the 1-recall loss is implemented. We will show three different Upper Calibration Bounds (UCB) (Hoeffding, Bernstein and Waudby-Smith–Ramdas) of R(\lambda) based on the empirical risk which is defined as follows:

\hat{R}(\lambda) = \frac{1}{n}\sum_{i=1}^n L(Y_i, T_{\lambda}(X_i))

1.2.1. Hoeffding Bound

Suppose the loss is bounded above by one, then we have by the Hoeffding inequality that:

P((\hat{R}(\lambda)-R(\lambda) \leq -x)) = \exp\{-2nx^2\}

Which implies the following UCB:

\hat{R}_{Hoeffding}^+(\lambda) = \hat{R}(\lambda) + \sqrt{\frac{1}{2n}\log\frac{1}{\delta}}

1.2.2. Bernstein Bound

Contrary to the Hoeffding bound, which can sometimes be too simple, the Bernstein UCB is taking into account the variance and gives smaller prediction set size:

\hat{R}_{Bernstein}^+(\lambda) = \hat{R}(\lambda) + \hat{\sigma}(\lambda)\sqrt{\frac{2\log(2/\delta)}{n}} + \frac{7\log (2/\delta)}{3(n-1)}

Where:

\hat{\sigma}(\lambda) = \frac{1}{n-1}\sum_{i=1}^n(L(Y_i, T_{\lambda}(X_i)) - \hat{R}(\lambda))^2

1.2.3. Waudby-Smith–Ramdas

This last UCB is the one recommended by the authors of [1] to use when using a bounded loss as this is the one which gives the smallest prediction sets size while having the same risk guarantees. This UCB is defined as follows:

Let L_i (\lambda) = L(Y_i, T_{\lambda}(X_i) and

\hat{\mu}_i (\lambda) = \frac{1/2 + \sum_{j=1}^i L_j (\lambda)}{1 + i},
\hat{\sigma}_i^2 (\lambda) = \frac{1/4 + \sum_{j=1}^i (L_j (\lambda) - \hat{\mu}_i (\lambda))}{1 + i},
\nu_i (\lambda) = \min \left\{ 1, \sqrt{\frac{2\log (1/\delta}{n \hat{\sigma}_{i-1}^2 (\lambda)}}\right\}

Further let:

K_i(R, \lambda) = \prod_{j=1}^i\{1 - \nu_j(\lambda)(L_j (\lambda) - R)\}

Then:

\hat{R}_{WSR}^+(\lambda) = \inf \{ R \geq 0 : \max_{i=1,...n} K_i(R, \lambda) > \frac{1}{\delta}\}

2. Conformal Risk Control

The goal of this method is to control any monotone and bounded loss. The result of this method can be expressed as follows:

\mathbb{E}\left[L_{n+1}(\hat{\lambda})\right] \leq \alpha

Where L_{i}(\lambda) = l(C_{\lambda}(X_{i}), Y_{i})

In the case of multi-label classification, C_{\lambda}(x) = \{ k : f(X)_k \geq 1 - \lambda \}

To find the optimal value of \lambda, the following algorithm is applied:

\hat{\lambda} = \inf \{ \lambda: \frac{n}{n + 1}\hat{R}_n (\lambda) + \frac{B}{n + 1} \leq \alpha \}

With :

\hat{R}_n (\lambda) = (L_{1}(\lambda) + ... + L_{n}(\lambda)) / n

3. References

[1] Lihua Lei Jitendra Malik Stephen Bates, Anastasios Angelopoulos and Michael I. Jordan. Distribution-free, risk-controlling prediction sets. CoRR, abs/2101.02703, 2021. URL https://arxiv.org/abs/2101.02703.39

[2] Angelopoulos, Anastasios N., Stephen, Bates, Adam, Fisch, Lihua, Lei, and Tal, Schuster. “Conformal Risk Control.” (2022).