External Hallucinations in LLMs

Lilian Weng, a researcher at Open AI, has published another groundbreaking article, thoroughly reviewing the causes of hallucinations in LLMs, detection methods, and strategies to prevent hallucinations. Firstly, the article analyzes issues with the pre-training dataset and the risks of introducing new knowledge during the fine-tuning phase, pointing out that these factors can lead to the model generating hallucinatory content. Then, the author introduces several hallucination detection methods, including retrieval-based evaluation, sampling-based detection, unknown knowledge calibration, and indirect queries. When introducing these methods, the author emphasizes their importance in improving the factual correctness of the model. The article also summarizes various anti-hallucination strategies, such as retrieval-augmented generation and editing (RAG), action chains, sampling methods, fine-tuning for factual correctness, and fine-tuning for attribution.

I. Definition and Classification of Hallucinations

Definition:
Hallucinations in large language models usually refer to the model generating content that is unfaithful, fictional, inconsistent, or meaningless. This article focuses on hallucination issues where the model output is fictional and not supported by the provided context or world knowledge.

Classification:
Contextual Hallucinations: Model output should be consistent with the source content in the context.
External Hallucinations: Model output should be based on the pre-training dataset. Considering the pre-training data corpus as a representative of world knowledge, it is necessary to ensure that the model output is factual and can be verified by external world knowledge, and when the model does not know a fact, it should indicate it. This article mainly focuses on external hallucinations.

II. Causes of Hallucinations

Pre-training Data Issues:
The pre-training data corpus is huge and usually consists of data crawled from the internet, which may contain outdated, missing, or incorrect information. The model may mistakenly remember this information, leading to mistakes.

Fine-tuning New Knowledge:
Fine-tuning pre-trained language models through supervised fine-tuning and reinforcement learning from human feedback (RLHF) is a common technique. It is difficult to avoid introducing new knowledge during the fine-tuning phase, but fine-tuning usually consumes fewer computational resources. There is controversy over whether the model can reliably learn new knowledge through small-scale fine-tuning. Studies have found that language models learn examples with new knowledge slower than examples consistent with the model's existing knowledge, and once new knowledge examples are learned, they increase the model's tendency to produce hallucinations.

III. Hallucination Detection Methods

Retrieval-Enhanced Evaluation:
FactualityPrompt Benchmark: Introduced by Lee et al., it includes factual and non-factual prompts, using Wikipedia documents or sentences as factual evidence. It evaluates hallucinations by detecting named entity errors and implication ratios, with larger models performing better on this benchmark.

FActScore: Breaks down long text generation into multiple atomic facts and verifies their consistency with a knowledge base (such as Wikipedia), measuring the proportion of sentences in the model's generation supported by the knowledge base.

SAFE: A method proposed by Wei et al. for evaluating the factuality of long texts, using a language model as a proxy to reason whether facts are supported by iteratively issuing Google search queries.

FacTool: Follows the standard fact-checking workflow and can detect factual errors in various tasks, including knowledge Q&A, code generation, mathematical problem-solving, and scientific literature review.

Sampling-Based Detection:
SelfCheckGPT: Detects hallucinations by checking the consistency of factual errors in multiple samples from a black-box language model. This method uses different indicators to measure the consistency of the model's response with other random model samples, with detection through prompts performing better.

Unknown Knowledge Calibration:
TruthfulQA and SelfAware Benchmarks: Used to measure the model's ability to generate truthful responses when faced with unanswerable or unknown questions. The model should refuse to answer or provide relevant information. Studies have found that larger models have lower accuracy on the TruthfulQA benchmark but perform better on the binary classification task of answerable and unanswerable questions in SelfAware.

Output Uncertainty Measurement: Assesses the model's awareness of unknown knowledge by measuring the output uncertainty between known and unknown questions. Studies show that language models have well-calibrated probability estimates for the correctness of answers on certain types of multiple-choice questions, and RLHF fine-tuning makes calibration worse, but higher sampling temperatures can improve calibration results.

Indirect Queries:
Agrawal et al. studied the situation of hallucinatory references in language model generation, experimenting with direct and indirect query methods. Indirect queries check hallucinations by asking for auxiliary details of the generated reference (such as the author), and experiments show that indirect query methods work better, with larger models being more capable and having fewer hallucinations.

IV. Anti-Hallucination Methods

RAG (Retrieval-Augmented Generation) Related Methods:
RARR Framework: Retroactively makes language models support attribution to external evidence through editing and attribution. It includes research and revision stages, with the revision stage editing the output to correct content not supported by evidence. When evaluating revised text, both attribution and retention of original text indicators are important, and RARR performs better in retention indicators.

FAVA Model: Retrieves relevant documents and edits model output to avoid hallucination errors. Composed of a retriever and an editor, the editor needs fine-tuning, generating synthetic training data by inserting random errors in model generation.

Rethinking with Retrieval (RR) Method: Relies on retrieving relevant external knowledge, based on decomposed chain of thought (CoT) prompts for retrieval, and selects answers that best match the retrieved knowledge.

Self-RAG: Trains language models to learn to reflect on their own generation end-to-end, creating supervised datasets through output tasks and special reflection markers, and distilling them into internal models to reduce inference costs.

Action Chain Methods:
Chain-of-Verification (CoVe) Method: Plans and executes verification based on a series of actions, including four core steps: baseline response, planning verification, executing verification, and final output. Studies have found that breaking down verification problems and answering them separately works better than long text generation, and some variants such as factored and 2-step CoVe can improve performance.

RECITE Method: Uses recitation as an intermediate step to improve the factual correctness of model generation and reduce hallucinations, teaching the model to recite relevant information through few-shot context prompts and then generating answers.

Sampling Methods:
Factuality-Core Sampling Algorithm: Lee et al. found that core sampling performs worse than greedy sampling on the FactualityPrompt benchmark but is better in terms of diversity and repetition. Based on the assumption, this algorithm dynamically adjusts sampling probabilities, reducing hallucination errors while maintaining good diversity and repetition.

Inference-Time Intervention (ITI): Discriminates between real and false outputs by fitting linear probes on each layer activation, finding attention heads related to factuality and moving their activation towards the "real" direction during inference.

Fine-Tuning Methods:
Factuality-Enhanced Training: Lee et al. proposed introducing topic prefixes and using sentence completion loss as training objectives to enhance factuality during training. Lin et al. proposed focused factuality SFT + RLHF alignment training, including a factuality-aware SFT phase and a factuality-aware DPO phase, generating training data in different ways and fine-tuning. Tian and Mitchell et al. also adjusted factuality by fine-tuning language models, including extracting atomic claims and estimating their truthfulness scores, constructing training datasets, and fine-tuning with DPO.

Attribution Fine-Tuning: Allocating attribution in model output is a good way to reduce hallucinations. WebGPT combines web search and fine-tuned GPT models, assisting in judging factual correctness by citing web pages. GopherCite is similar to WebGPT but generates demonstrations through few-shot prompts and uses a reward model for scoring. Models can also be configured to answer "I don't know" when uncertain to avoid low-quality responses.