.. title:: Theoretical Description Binary Classification : contents

.. _theoretical_description_binay_classification:

#######################
Theoretical Description
#######################

Note: in theoretical parts of the documentation, we use the following terms employed in the scientific literature:

- `alpha` is equivalent to `1 - confidence_level`. It can be seen as a *risk level*
- *calibrate* and *calibration*, are equivalent to *conformalize* and *conformalization*.

—

There are mainly three different ways to handle uncertainty quantification in binary classification:
calibration (see :doc:`theoretical_description_calibration`), confidence interval (CI) for the probability
:math:`P(Y \vert \hat{\mu}(X))` and prediction sets (see :doc:`theoretical_description_classification`).
These 3 notions are tightly related for score-based classifier, as it is shown in [1]. 

Prediction sets can be computed in the same way for multiclass and binary classification with
:class:`~mapie.classification.SplitConformalClassifier` or :class:`~mapie.classification.CrossConformalClassifier`,
and there are the same theoretical guarantees. Nevertheless, prediction sets are often much less informative in the
binary case than in the multiclass case.

From Gupta et al [1]:

    PSs and CIs are only ‘informative’ if the sets or intervals produced by them are small. To quantify
    this, we measure CIs using their width (denoted as :math:`|C(.)|)`, and PSs using their diameter (defined as
    the width of the convex hull of the PS). For example, in the case of binary classification, the diameter
    of a PS is :math:`1` if the prediction set is :math:`\{0,1\}`, and :math:`0` otherwise (since :math:`Y\in\{0,1\}`
    always holds, the set :math:`\{0,1\}` is ‘uninformative’). A short CI such as :math:`[0.39, 0.41]`
    is more informative than a wider one such as :math:`[0.3, 0.5]`.

In a few words, what you need to remember about these concepts :

* *Calibration* is useful for transforming a score (typically given by an ML model)
  into the probability of making a good prediction.
* *Set Prediction* gives the set of likely predictions with a probabilistic guarantee that the true label is in this set.
* *Probabilistic Prediction* gives a confidence interval for the predictive distribution.


1. Set Prediction
-----------------

Definition 1 (Prediction Set (PS) w.r.t :math:`f`) [1].
    Fix a predictor :math:`\hat{\mu}:\mathcal{X} \to [0, 1]` and let :math:`(\mathcal{X}, \mathcal{Y}) \sim P`.
    Define the set of all subsets of :math:`\mathcal{Y}`, :math:`L = \{\{0\}, \{1\}, \{0, 1\}, \emptyset\}`.
    A function :math:`S:[0,1]\to\mathcal{L}` is said to be :math:`(1-\alpha)`-PS with respect to :math:`\hat{\mu}` if:

.. math:: 
    P(Y\in S(\hat{\mu}(X))) \geq 1 - \alpha

PSs are typically studied for larger output sets, such as :math:`\mathcal{Y}_{regression}=\mathbb{R}` or
:math:`\mathcal{Y}_{multiclass}=\{1, 2, ..., L > 2\}`.

See :class:`~mapie.classification.SplitConformalClassifier` and :class:`~mapie.classification.CrossConformalClassifier`
to use a set predictor.


2. Probabilistic Prediction
---------------------------

Definition 2 (Confidence Interval (CI) w.r.t :math:`\hat{\mu}`) [1].
    Fix a predictor :math:`\hat{\mu}:\mathcal{X} \to [0, 1]` and let :math:`(\mathcal{X}, \mathcal{Y}) \sim P`.
    Let :math:`I` denote the set of all subintervals of :math:`[0,1]`.
    A function :math:`C:[0,1]\to\mathcal{I}` is said to be :math:`(1-\alpha)`-CI with respect to :math:`\hat{\mu}` if:

.. math:: 
    P(\mathbb{E}[Y|\hat{\mu}(X)]\in C(\hat{\mu}(X))) \geq 1 - \alpha


3. Calibration
--------------

Usually, calibration is understood as perfect calibration meaning (see :doc:`theoretical_description_calibration`).
In practice, it is more reasonable to consider approximate calibration.

Definition 3 (Approximate calibration) [1].
    Fix a predictor :math:`\hat{\mu}:\mathcal{X} \to [0, 1]` and let :math:`(\mathcal{X}, \mathcal{Y}) \sim P`.
    The predictor :math:`\hat{\mu}:\mathcal{X} \to [0, 1]` is :math:`(\epsilon,\alpha)`-calibrated
    for some :math:`\epsilon,\alpha\in[0, 1]` if with probability at least :math:`1-\alpha`:

.. math:: 
    |\mathbb{E}[Y|\hat{\mu}(X)] - \hat{\mu}(X)| \leq \epsilon

See :class:`~sklearn.calibration.CalibratedClassifierCV` or :class:`~mapie.calibration.TopLabelCalibrator`
to use a calibrator.

In the CP framework, it is worth noting that Venn predictors produce probability-type predictions
for the labels of test objects which are guaranteed to be well-calibrated under the standard assumption
that the observations are generated independently from the same distribution [2].


References
----------

[1] Gupta, Chirag, Aleksandr Podkopaev, and Aaditya Ramdas.
"Distribution-free binary classification: prediction sets, confidence intervals, and calibration."
Advances in Neural Information Processing Systems 33 (2020): 3711-3723.

[2] Vovk, Vladimir, Alexander Gammerman, and Glenn Shafer.
"Algorithmic Learning in a Random World."
Springer Nature, 2022.