Theoretical Description¶

Two methods for multi-label uncertainty-quantification have been implemented in MAPIE so far : Risk-Controlling Prediction Sets (RCPS) [1] and Conformal Risk Control (CRC) [2]. The difference between these methods is the way the conformity scores are computed.

For a multi-label classification problem in a standard independent and identically distributed (i.i.d) case, our training data $(X, Y) = \{(x_1, y_1), \ldots, (x_n, y_n)\}$ has an unknown distribution $P_{X, Y}$ .

For any risk level $\alpha$ between 0 and 1, the methods implemented in MAPIE allow the user to construct a prediction set $\hat{C}_{n, \alpha}(X_{n+1})$ for a new observation $\left( X_{n+1},Y_{n+1} \right)$ with a guarantee on the recall. RCPS and CRC give two slightly different guarantees:

RCPS:

$\mathbb{P}(R(\mathcal{T}_{\hat{\lambda}}) \leq \alpha ) \geq 1 - \delta$

CRC:

$\mathbb{E}\left[L_{n+1}(\hat{\lambda})\right] \leq \alpha$

1. Risk-Controlling Prediction Sets¶

1.1. General settings¶

Let’s first give the settings and the notations of the method:

Let $\mathcal{T}_{\hat{\lambda}}: X \longrightarrow Y'$ be a set-valued function (a tolerance region) that maps a feature vector to a set-valued prediction. This function is built from the model which was previously fitted on the training data. It is indexed by a one-dimensional parameter $\lambda$ which is taking values in $\Lambda \subset \mathbb{R} \cup \{ \pm \infty \}$ such that:

$\lambda_1 < \lambda_2 \Rightarrow \mathcal{T}_{\lambda_1}(x) \subset \mathcal{T}_{\lambda_2}(x)$

Let $L: Y\times Y' \longrightarrow \mathbb{R}^+$ be a loss function on a prediction set with the following nesting property:

$S_1 \subset S_2 \Rightarrow L(y, S_1) \geq L(y, S_2)$

Let $R$ be the risk associated to a set-valued predictor:

$R(\mathcal{T}_{\hat{\lambda}}) = \mathbb{E}[L(Y, \mathcal{T}_{\lambda}(X)]$

The goal of the method is to compute an Upper Confidence Bound (UCB) $\hat{R}^+(\lambda)$ of $R(\lambda)$ and then to find $\hat{\lambda}$ as follows:

$\hat{\lambda} = \inf\{\lambda \in \Lambda: \hat{R}^+(\lambda ') < \alpha, \forall \lambda ' \geq \lambda \}$

The figure bellow explains this procedure:

Following those settings, the RCPS method gives the following guarantee on the recall:

$\mathbb{P}(R(\mathcal{T}_{\hat{\lambda}}) \leq \alpha ) \geq 1 - \delta$

1.2. Bounds calculation¶

In this section, we will consider only bounded losses (as for now, only the $1-recall$ loss is implemented. We will show three different Upper Calibration Bounds (UCB) (Hoeffding, Bernstein and Waudby-Smith–Ramdas) of $R(\lambda)$ based on the empirical risk which is defined as follows:

$\hat{R}(\lambda) = \frac{1}{n}\sum_{i=1}^n L(Y_i, T_{\lambda}(X_i))$

1.2.1. Hoeffding Bound¶

Suppose the loss is bounded above by one, then we have by the Hoeffding inequality that:

$P((\hat{R}(\lambda)-R(\lambda) \leq -x)) = \exp\{-2nx^2\}$

Which implies the following UCB:

$\hat{R}_{Hoeffding}^+(\lambda) = \hat{R}(\lambda) + \sqrt{\frac{1}{2n}\log\frac{1}{\delta}}$

1.2.2. Bernstein Bound¶

Contrary to the Hoeffding bound, which can sometimes be too simple, the Bernstein UCB is taking into account the variance and gives smaller prediction set size:

$\hat{R}_{Bernstein}^+(\lambda) = \hat{R}(\lambda) + \hat{\sigma}(\lambda)\sqrt{\frac{2\log(2/\delta)}{n}} + \frac{7\log (2/\delta)}{3(n-1)}$

Where:

$\hat{\sigma}(\lambda) = \frac{1}{n-1}\sum_{i=1}^n(L(Y_i, T_{\lambda}(X_i)) - \hat{R}(\lambda))^2$

1.2.3. Waudby-Smith–Ramdas¶

This last UCB is the one recommended by the authors of [1] to use when using a bounded loss as this is the one which gives the smallest prediction sets size while having the same risk guarantees. This UCB is defined as follows:

Let $L_i (\lambda) = L(Y_i, T_{\lambda}(X_i)$ and

$\hat{\mu}_i (\lambda) = \frac{1/2 + \sum_{j=1}^i L_j (\lambda)}{1 + i}, \hat{\sigma}_i^2 (\lambda) = \frac{1/4 + \sum_{j=1}^i (L_j (\lambda) - \hat{\mu}_i (\lambda))}{1 + i}, \nu_i (\lambda) = \min \left\{ 1, \sqrt{\frac{2\log (1/\delta}{n \hat{\sigma}_{i-1}^2 (\lambda)}}\right\}$

Further let:

$K_i(R, \lambda) = \prod_{j=1}^i\{1 - \nu_j(\lambda)(L_j (\lambda) - R)\}$

Then:

$\hat{R}_{WSR}^+(\lambda) = \inf \{ R \geq 0 : \max_{i=1,...n} K_i(R, \lambda) > \frac{1}{\delta}\}$

2. Conformal Risk Control¶

The goal of this method is to control any monotone and bounded loss. The result of this method can be expressed as follows:

$\mathbb{E}\left[L_{n+1}(\hat{\lambda})\right] \leq \alpha$

Where $L_{i}(\lambda) = l(C_{\lambda}(X_{i}), Y_{i})$

In the case of multi-label classification, $C_{\lambda}(x) = \{ k : f(X)_k \geq 1 - \lambda \}$

To find the optimal value of $\lambda$ , the following algorithm is applied:

$\hat{\lambda} = \inf \{ \lambda: \frac{n}{n + 1}\hat{R}_n (\lambda) + \frac{B}{n + 1} \leq \alpha \}$

With :

$\hat{R}_n (\lambda) = (L_{1}(\lambda) + ... + L_{n}(\lambda)) / n$

3. References¶

[1] Lihua Lei Jitendra Malik Stephen Bates, Anastasios Angelopoulos and Michael I. Jordan. Distribution-free, risk-controlling prediction sets. CoRR, abs/2101.02703, 2021. URL https://arxiv.org/abs/2101.02703.39

[2] Angelopoulos, Anastasios N., Stephen, Bates, Adam, Fisch, Lihua, Lei, and Tal, Schuster. “Conformal Risk Control.” (2022).