Tutorial for tabular regression with Mondrian

In this tutorial, we compare the prediction intervals estimated by MAPIE on a simple, one-dimensional, ground truth function with classical conformal prediction intervals versus Mondrian conformal prediction intervals. The function is a sinusoidal function with added noise, and the data is grouped in 10 groups. The goal is to estimate the prediction intervals for new data points, and to compare the coverage of the prediction intervals by groups. Throughout this tutorial, we will answer the following questions:

  • How to use MAPIE to estimate prediction intervals for a regression problem?

  • How to use Mondrian conformal prediction intervals for regression?

  • How to compare the coverage of the prediction intervals by groups?

import os
import warnings

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

from mapie.metrics import regression_coverage_score_v2
from mapie.mondrian import MondrianCP
from mapie.regression import MapieRegressor

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
warnings.filterwarnings("ignore")

1. Create the noisy dataset

We create a dataset with 10 groups, each of those groups having a different level of noise.

We plot the dataset with the partition as colors.

plot main tutorial mondrian regression

2. Split the dataset into a training set, a calibration set, and a test set.

X_train_temp, X_test, y_train_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)
partition_train_temp, partition_test, _, _ = train_test_split(
    partition, y, test_size=0.2, random_state=0
)
X_cal, X_train, y_cal, y_train = train_test_split(
    X_train_temp, y_train_temp, test_size=0.5, random_state=0
)
partition_cal, partition_train, _, _ = train_test_split(
    partition_train_temp, y_train_temp, test_size=0.5, random_state=0
)

We plot the training set, the calibration set, and the test set.

f, ax = plt.subplots(1, 3, figsize=(15, 5))
ax[0].scatter(X_train, y_train, c=partition_train)
ax[0].set_title("Train set")
ax[1].scatter(X_cal, y_cal, c=partition_cal)
ax[1].set_title("Calibration set")
ax[2].scatter(X_test, y_test, c=partition_test)
ax[2].set_title("Test set")
plt.show()
Train set, Calibration set, Test set

3. Fit a random forest regressor on the training set.

RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


4. Fit a MapieRegressor and a MondrianCP on the calibration set.

MondrianCP(mapie_estimator=MapieRegressor(cv='prefit',
                                          estimator=RandomForestRegressor()))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


5. Predict the prediction intervals on the test set with both methods.

6. Compare the coverage by partition, plot both methods side by side.

coverages = {}
for group in np.unique(partition_test):
    coverages[group] = {}
    coverages[group]["split"] = regression_coverage_score_v2(
        y_test[partition_test == group], y_pss_split[partition_test == group]
    )
    coverages[group]["mondrian"] = regression_coverage_score_v2(
        y_test[partition_test == group],
        y_pss_mondrian[partition_test == group]
    )


# Plot the coverage by groups, plot both methods side by side
plt.figure(figsize=(10, 5))
plt.bar(
    np.arange(len(coverages)) * 2,
    [float(coverages[group]["split"]) for group in coverages],
    label="Split"
)
plt.bar(
    np.arange(len(coverages)) * 2 + 1,
    [float(coverages[group]["mondrian"]) for group in coverages],
    label="Mondrian"
)
plt.xticks(
    np.arange(len(coverages)) * 2 + .5,
    [f"Group {group}" for group in coverages],
    rotation=45
)
plt.hlines(0.9, -1, 21, label="90% coverage", color="black", linestyle="--")
plt.ylabel("Coverage")
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.tight_layout()
plt.show()
plot main tutorial mondrian regression

Total running time of the script: ( 0 minutes 16.655 seconds)

Gallery generated by Sphinx-Gallery