How to control LLM risks?

Risk control techniques in binary classification can be used to enhance LLM trustworthiness by implementing response guardrails, such as censoring undesired content. For an example, see Risk Control for LLM as a Judge with Abstention.

Furthermore, conformal prediction methods can be applied to LLM-based classifiers. We propose a method presented in Benchmarking LLMs via Uncertainty Quantification, which reduces a commonsense reasoning task (CosmosQA dataset) to a classification problem, enabling the use of conformal predictions. The idea is to extract only the logits corresponding to the possible answers, and use a softmax so that the LLM can be used as a simple classifier.

The following repository (not maintained by the MAPIE team) implements part of this paper for educational purposes in the MAPIE_for_cosmosqa notebook.

Additionally, we invite you to read our blog article, where we dive deeper into the topic.