.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples_calibration/1-quickstart/plot_calibration_hypothesis_testing.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_calibration_1-quickstart_plot_calibration_hypothesis_testing.py: ========================================================= Testing for calibration in binary classification settings ========================================================= This example uses :func:`~mapie.metrics.kolmogorov_smirnov_pvalue` to test for calibration of scores output by binary classifiers. Other alternatives are :func:`~mapie.metrics.kuiper_pvalue` and :func:`~mapie.metrics.spieglehalter_pvalue`. These statistical tests are based on the following references: [1] Arrieta-Ibarra I, Gujral P, Tannen J, Tygert M, Xu C. Metrics of calibration for probabilistic predictions. The Journal of Machine Learning Research. 2022 Jan 1;23(1):15886-940. [2] Tygert M. Calibration of P-values for calibration and for deviation of a subpopulation from the full population. arXiv preprint arXiv:2202.00100. 2022 Jan 31. [3] D. A. Darling. A. J. F. Siegert. The First Passage Problem for a Continuous Markov Process. Ann. Math. Statist. 24 (4) 624 - 639, December, 1953. .. GENERATED FROM PYTHON SOURCE LINES 28-40 .. code-block:: Python import numpy as np from matplotlib import pyplot as plt from sklearn.utils import check_random_state from numpy.typing import NDArray from mapie.metrics.calibration import ( cumulative_differences, kolmogorov_smirnov_p_value, length_scale, ) .. GENERATED FROM PYTHON SOURCE LINES 41-47 1. Create 1-dimensional dataset and scores to test for calibration ------------------------------------------------------------------ We start by simulating a 1-dimensional binary classification problem. We assume that the ground truth probability is driven by a sigmoid function, and we generate label according to this probability distribution. .. GENERATED FROM PYTHON SOURCE LINES 47-65 .. code-block:: Python def sigmoid(x: NDArray): y = 1 / (1 + np.exp(-x)) return y def generate_y_true_calibrated(y_prob: NDArray, random_state: int = 1) -> NDArray: generator = check_random_state(random_state) uniform = generator.uniform(size=len(y_prob)) y_true = (uniform <= y_prob).astype(float) return y_true X = np.linspace(-5, 5, 2000) y_prob = sigmoid(X) y_true = generate_y_true_calibrated(y_prob) .. GENERATED FROM PYTHON SOURCE LINES 66-67 Next we provide two additional miscalibrated scores (on purpose). .. GENERATED FROM PYTHON SOURCE LINES 67-71 .. code-block:: Python y = {"y_prob": y_prob, "y_pred_1": sigmoid(1.3 * X), "y_pred_2": sigmoid(0.7 * X)} .. GENERATED FROM PYTHON SOURCE LINES 72-74 This is how the two miscalibration curves stands next to the ground truth. .. GENERATED FROM PYTHON SOURCE LINES 74-85 .. code-block:: Python for name, y_score in y.items(): plt.plot(X, y_score, label=name) plt.title("Probability curves") plt.xlabel("x") plt.ylabel("y") plt.grid() plt.legend() plt.show() .. image-sg:: /examples_calibration/1-quickstart/images/sphx_glr_plot_calibration_hypothesis_testing_001.png :alt: Probability curves :srcset: /examples_calibration/1-quickstart/images/sphx_glr_plot_calibration_hypothesis_testing_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 86-88 Alternatively, you can readily see how much there is miscalibration in this view where we plot scores against the ground truth probability. .. GENERATED FROM PYTHON SOURCE LINES 88-98 .. code-block:: Python for name, y_score in y.items(): plt.plot(y_prob, y_score, label=name) plt.title("Probability curves") plt.xlabel("True probability") plt.ylabel("Estimated probability") plt.grid() plt.legend() plt.show() .. image-sg:: /examples_calibration/1-quickstart/images/sphx_glr_plot_calibration_hypothesis_testing_002.png :alt: Probability curves :srcset: /examples_calibration/1-quickstart/images/sphx_glr_plot_calibration_hypothesis_testing_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 99-113 2. Visualizing and testing for miscalibration ------------------------------------------------------------------ We leverage the Kolomogorov-Smirnov statistical test :func:`~mapie.metrics.kolmogorov_smirnov_pvalue`. It is based on the cumulative difference between sorted scores and labels. If the null hypothesis holds (i.e., the scores are well calibrated), the curve of the cumulative differences share some nice properties with the standard Brownian motion, in particular its range and maximum absolute value [1, 2]. Let's have a look. First we compute the cumulative differences. .. GENERATED FROM PYTHON SOURCE LINES 113-119 .. code-block:: Python cum_diffs = { name: cumulative_differences(y_true, y_score) for name, y_score in y.items() } .. GENERATED FROM PYTHON SOURCE LINES 120-121 We want to plot is along the proportion of scores taken into account. .. GENERATED FROM PYTHON SOURCE LINES 121-125 .. code-block:: Python k = np.arange(len(y_true)) / len(y_true) .. GENERATED FROM PYTHON SOURCE LINES 126-128 We also want to compare the extension of the curve to that of a typical Brownian motion. .. GENERATED FROM PYTHON SOURCE LINES 128-132 .. code-block:: Python sigma = length_scale(y_prob) .. GENERATED FROM PYTHON SOURCE LINES 133-134 Finally, we compute the p-value according to Kolmogorov-Smirnov test [2, 3]. .. GENERATED FROM PYTHON SOURCE LINES 134-140 .. code-block:: Python p_values = { name: kolmogorov_smirnov_p_value(y_true, y_score) for name, y_score in y.items() } .. GENERATED FROM PYTHON SOURCE LINES 141-152 The graph hereafter shows cumulative differences of each series of scores. The horizontal bars are typical length scales expected if the null hypothesis holds (standard Brownian motion). You can see that our two miscalibrated scores overshoot these limits, and that their p-values are accordingly very small. On the contrary, you can see that the well calibrated ground truth perfectly lies within the expected bounds with a p-value close to 1. So we conclude by both visual and statistical arguments that we reject the null hypothesis for the two miscalibrated scores ! .. GENERATED FROM PYTHON SOURCE LINES 152-164 .. code-block:: Python for name, cum_diff in cum_diffs.items(): plt.plot(k, cum_diff, label=f"name (p-value = {p_values[name]:.5f})") plt.axhline(y=2 * sigma, color="r", linestyle="--") plt.axhline(y=-2 * sigma, color="r", linestyle="--") plt.title("Probability curves") plt.xlabel("Proportion of scores considered") plt.ylabel("Cumulative differences with the ground truth") plt.grid() plt.legend() plt.show() .. image-sg:: /examples_calibration/1-quickstart/images/sphx_glr_plot_calibration_hypothesis_testing_003.png :alt: Probability curves :srcset: /examples_calibration/1-quickstart/images/sphx_glr_plot_calibration_hypothesis_testing_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.222 seconds) .. _sphx_glr_download_examples_calibration_1-quickstart_plot_calibration_hypothesis_testing.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_calibration_hypothesis_testing.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_calibration_hypothesis_testing.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_calibration_hypothesis_testing.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_