Tutorial for classification

In this tutorial, we compare the prediction sets estimated by the conformal methods implemented in MAPIE on a toy two-dimensional dataset.

Throughout this tutorial, we will answer the following questions:

  • How does the number of classes in the prediction sets vary according to the significance level ?

  • Is the chosen conformal method well calibrated ?

  • What are the pros and cons of the conformal methods included in MAPIE ?

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

from mapie.classification import MapieClassifier
from mapie.metrics import (classification_coverage_score,

1. Conformal Prediction method using the softmax score of the true label

We will use MAPIE to estimate a prediction set of several classes such that the probability that the true label of a new test point is included in the prediction set is always higher than the target confidence level : P(Y_{n+1} \in \hat{C}_{n, \alpha}(X_{n+1}) \geq 1 - \alpha. We start by using the softmax score output by the base classifier as the conformity score on a toy two-dimensional dataset.

We estimate the prediction sets as follows :

  • Generate a dataset with train, calibration and test, the model is fitted on the training set.

  • Set the conformal score S_i = \hat{f}(X_{i})_{y_i} the softmax output of the true class for each sample in the calibration set.

  • Define \hat{q} as being the (n + 1) (\alpha) / n previous quantile of S_{1}, ..., S_{n} (this is essentially the quantile \alpha, but with a small sample correction).

  • Finally, for a new test data point (where X_{n + 1} is known but Y_{n + 1} is not), create a prediction set C(X_{n+1}) = \{y: \hat{f}(X_{n+1})_{y} > \hat{q}\} which includes all the classes with a sufficiently high softmax output.

# We use a two-dimensional toy dataset with three labels. The distribution of
# the data is a bivariate normal with diagonal covariance matrices for each
# label.

centers = [(0, 3.5), (-2, 0), (2, 0)]
covs = [np.eye(2), np.eye(2)*2, np.diag([5, 1])]
x_min, x_max, y_min, y_max, step = -6, 8, -6, 8, 0.1
n_samples = 1000
n_classes = 3
X = np.vstack([
    np.random.multivariate_normal(center, cov, n_samples)
    for center, cov in zip(centers, covs)
y = np.hstack([np.full(n_samples, i) for i in range(n_classes)])
X_train_cal, X_test, y_train_cal, y_test = train_test_split(
    X, y, test_size=0.2
X_train, X_cal, y_train, y_cal = train_test_split(
    X_train_cal, y_train_cal, test_size=0.25

xx, yy = np.meshgrid(
    np.arange(x_min, x_max, step), np.arange(x_min, x_max, step)
X_test_mesh = np.stack([xx.ravel(), yy.ravel()], axis=1)

Let’s see our training data.

colors = {0: "#1f77b4", 1: "#ff7f0e", 2:  "#2ca02c", 3: "#d62728"}
y_train_col = list(map(colors.get, y_train))
fig = plt.figure()
    X_train[:, 0],
    X_train[:, 1],
plot main tutorial classification

We fit our training data with a Gaussian Naive Base estimator. And then we apply MAPIE in the calibration data with the method score to the estimator indicating that it has already been fitted with cv=”prefit”. We then estimate the prediction sets with differents alpha values with a fit and predict process.

  • y_pred_score: represents the prediction in the test set by the base estimator.

  • y_ps_score: reprensents the prediction sets estimated by MAPIE with the “lac” method.

def plot_scores(n, alphas, scores, quantiles):
    colors = {0: "#1f77b4", 1: "#ff7f0e", 2: "#2ca02c"}
    plt.figure(figsize=(7, 5))
    plt.hist(scores, bins="auto")
    for i, quantile in enumerate(quantiles):
            label=f"alpha = {alphas[i]}"
    plt.title("Distribution of scores")

Let’s see the distribution of the scores with the calculated quantiles.

Distribution of scores

The estimated quantile increases with alpha. A high value of alpha can potentially lead to a high quantile which would not necessarily be reached by any class in uncertain areas, resulting in null regions.

We will now visualize the differences between the prediction sets of the different values of alpha.

def plot_results(alphas, X, y_pred, y_ps):
    tab10 = plt.cm.get_cmap('Purples', 4)
    colors = {0: "#1f77b4", 1: "#ff7f0e", 2:  "#2ca02c", 3: "#d62728"}
    y_pred_col = list(map(colors.get, y_pred))
    fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(10, 10))
    axs = {0: ax1, 1: ax2, 2:  ax3, 3: ax4}
        X[:, 0],
        X[:, 1],
    axs[0].set_title("Predicted labels")
    for i, alpha in enumerate(alphas):
        y_pi_sums = y_ps[:, :, i].sum(axis=1)
        num_labels = axs[i+1].scatter(
            X[:, 0],
            X[:, 1],
        plt.colorbar(num_labels, ax=axs[i+1])
        axs[i+1].set_title(f"Number of labels for alpha={alpha}")

plot_results(alpha, X_test_mesh, y_pred_score, y_ps_score)
Predicted labels, Number of labels for alpha=0.2, Number of labels for alpha=0.1, Number of labels for alpha=0.05

When the class coverage is not large enough, the prediction sets can be empty when the model is uncertain at the border between two classes. The null region disappears for larger class coverages but ambiguous classification regions arise with several labels included in the prediction sets highlighting the uncertain behaviour of the base classifier.

Let’s now study the effective coverage and the mean prediction set widths as function of the 1-\alpha target coverage. To this aim, we use once again the predict method of MAPIE to estimate predictions sets on a large number of \alpha values.

alpha2 = np.arange(0.02, 0.98, 0.02)
_, y_ps_score2 = mapie_score.predict(X_test, alpha=alpha2)
coverages_score = [
    classification_coverage_score(y_test, y_ps_score2[:, :, i])
    for i, _ in enumerate(alpha2)
widths_score = [
    classification_mean_width_score(y_ps_score2[:, :, i])
    for i, _ in enumerate(alpha2)

def plot_coverages_widths(alpha, coverage, width, method):
    fig, axs = plt.subplots(1, 2, figsize=(12, 5))
    axs[0].scatter(1 - alpha, coverage, label=method)
    axs[0].set_xlabel("1 - alpha")
    axs[0].set_ylabel("Coverage score")
    axs[0].plot([0, 1], [0, 1], label="x=y", color="black")
    axs[1].scatter(1 - alpha, width, label=method)
    axs[1].set_xlabel("1 - alpha")
    axs[1].set_ylabel("Average size of prediction sets")

plot_coverages_widths(alpha2, coverages_score, widths_score, "lac")
plot main tutorial classification

2. Conformal Prediction method using the cumulative softmax score

We saw in the previous section that the “lac” method is well calibrated by providing accurate coverage levels. However, it tends to give null prediction sets for uncertain regions, especially when the \alpha value is high. MAPIE includes another method, called Adaptive Prediction Set (APS), whose conformity score is the cumulated score of the softmax output until the true label is reached (see the theoretical description for more details). We will see in this Section that this method no longer estimates null prediction sets but by giving slightly bigger prediction sets.

Let’s visualize the prediction sets obtained with the APS method on the test set after fitting MAPIE on the calibration set.

mapie_aps = MapieClassifier(
    estimator=clf, cv="prefit", method="aps"
mapie_aps.fit(X_cal, y_cal)
alpha = [0.2, 0.1, 0.05]
y_pred_aps, y_ps_aps = mapie_aps.predict(
    X_test_mesh, alpha=alpha, include_last_label=True

plot_results(alpha, X_test_mesh, y_pred_aps, y_ps_aps)
Predicted labels, Number of labels for alpha=0.2, Number of labels for alpha=0.1, Number of labels for alpha=0.05

One can notice that the uncertain regions are emphasized by wider boundaries, but without null prediction sets with respect to the first “lac” method.

_, y_ps_aps2 = mapie_aps.predict(
    X_test, alpha=alpha2, include_last_label="randomized"
coverages_aps = [
    classification_coverage_score(y_test, y_ps_aps2[:, :, i])
    for i, _ in enumerate(alpha2)
widths_aps = [
    classification_mean_width_score(y_ps_aps2[:, :, i])
    for i, _ in enumerate(alpha2)

plot_coverages_widths(alpha2, coverages_aps, widths_aps, "lac")
plot main tutorial classification

This method also gives accurate calibration plots, meaning that the effective coverage level is always very close to the target coverage, sometimes at the expense of slightly bigger prediction sets.

Total running time of the script: ( 0 minutes 3.354 seconds)

Gallery generated by Sphinx-Gallery