Reflection (artificial intelligence)

Reflection in artificial intelligence, also referred to as large reasoning models (LRMs), is the capability of large language models (LLMs) to examine, evaluate, and improve their own outputs. This process involves self-assessment and internal deliberation, aiming to enhance reasoning accuracy, minimize errors (like hallucinations), and increase interpretability. Reflection is a form of "test-time compute," where additional computational resources are used during inference.

Introduction

Traditional neural networks process inputs in a feedforward manner, generating outputs in a single pass. However, their limitations in handling complex reasoning tasks have led to the development of methods that simulate internal deliberation. Techniques such as chain-of-thought prompting encourage models to generate intermediate reasoning steps, thereby providing a form of self-reflection that can improve performance on tasks including arithmetic, commonsense reasoning, and more.

This internal process of "thinking" about the steps leading to an answer is analogous to human metacognition or "thinking about thinking". It helps AI systems approach tasks that require multi-step reasoning, planning, and logical thought. The feedback can take place either after a full network pass and decoding to tokens, or continuously in latent space (the last layer can be fed back to the first layer).^[1]^[2] In LLMs, special tokens can mark the beginning and end of reflection before producing a final response (e.g., <thinking>).

Techniques

Increasing the length of the Chain-of-Thought reasoning process, by passing the output of the model back to its input and doing multiple network passes, increases inference-time scaling.^[3] Reinforcement learning frameworks have also been used to steer the Chain-of-Thought. One example is Group Relative Policy Optimization (GRPO), used in DeepSeek-R1,^[4] a variant of policy gradient methods that eliminates the need for a separate "critic" model by normalizing rewards within a group of generated outputs, reducing computational cost. Architectural features like the Mixture-of-Experts (MoE) design and Multi-head Latent Attention, used in models like DeepSeek-V3, also contribute to efficiency, particularly for long contexts. Simple techniques like "budget forcing" (forcing the model to continue generating reasoning steps) have also proven effective in improving performance.^[5]

Types of reflection

Post-hoc reflection

Analyzes and critiques an initial output separately, often involving prompting the model to identify errors or suggest improvements after generating a response. The Reflexion framework follows this approach.^[6]^[7]

Iterative reflection

Revises earlier parts of a response dynamically during generation. Self-monitoring mechanisms allow the model to adjust reasoning as it progresses. Methods like Tree-of-Thoughts exemplify this, enabling backtracking and alternative exploration.

Intrinsic reflection

Integrates self-monitoring directly into the model architecture rather than relying solely on external prompts, enabling models with inherent awareness of their reasoning limitations and uncertainties. This has been used by Google DeepMind in a technique called Self-Correction via Reinforcement Learning (SCoRe) which rewards the model for improving its responses.^[8]

Process reward models and limitations

Early research explored PRMs to provide feedback on each reasoning step, unlike traditional reinforcement learning which rewards only the final outcome. However, PRMs have faced challenges, including computational cost and reward hacking. DeepSeek-R1's developers found them to be not beneficial.^[9]^[10]

Benchmarks

Reflective models generally outperform non-reflective models in most benchmarks, especially on tasks requiring multi-step reasoning.

However, some benchmarks exclude reflective models due to longer response times.

Humanity's Last Exam

The HLE, a rigorous benchmark designed to assess expert-level reasoning across mathematics, humanities, and the natural sciences, reveals substantial performance gaps among models. State-of-the-art reasoning models have demonstrated low accuracy on HLE, highlighting significant room for improvement. In particular, the full reasoning model o3 achieved an accuracy of 26.6%,^[11] while its lighter counterpart, o3‑mini-high (evaluated on text‑only questions), reached 13%.^[12]

AIME

The American Invitational Mathematics Examination (AIME) benchmark, a challenging mathematics competition, demonstrates significant performance differences between model types. Non-reasoning models typically solve less than 30% of AIME. In contrast, models employing reasoning techniques score between 50% and 80%.^[13] While OpenAI's o1 maintained or slightly improved its accuracy from reported 2024Template:Source? metrics to 2025 AIME results, o3-mini (high) achieved a higher accuracy (80%) at a significantly lower cost (approximately 12 times cheaper).

o3-mini performance

According to OpenAI's January 2025 report on o3-mini, adjustable "reasoning effort" significantly affects performance, particularly in STEM. Increasing reasoning effort from low to high boosts accuracy on benchmarks like AIME 2024, GPQA Diamond, and Codeforces, providing performance gains typically in the range of 10-30%. With high reasoning effort, o3-mini (high) achieved 87.3% in AIME (different from the MathArena AIME benchmark results), 79.7% in GPQA Diamond, 2130 Elo in Codeforces, and 49.3 in SWE-bench Verified.^[14]

Integration with search capabilities

In December 2024, Google introduced Deep Research in Gemini,^[15] a feature in Gemini that conducts multi-step research tasks.

On January 25, 2025, DeepSeek launched a feature in their DeepSeek R1 model, enabling the simultaneous use of search and reasoning capabilities, which allows for more efficient integration of data retrieval with reflective reasoning processes.

Subsequently, OpenAI's o3-mini model gained the ability to combine search and reasoning in a unified process.

On February 2, 2025, OpenAI released deep research,^[16] a tool that integrates reasoning and web search in a unified workflow, allowing users to perform complex research tasks that require multi-step reasoning and data synthesis from multiple sources. It is based on o3 and can take from 5 to 30 minutes to generate comprehensive reports.^[17]

History

2024

o1-preview, an LLM with enhanced reasoning, was released in September 2024.^[18] The full version, o1, followed in December 2024. OpenAI also began sharing results on its successor, o3.^[19]

The development of reasoning LLMs has illustrated what Rich Sutton termed the "bitter lesson": that general methods leveraging computation often outperform those relying on specific human insights.^[20] For instance, some research groups, such as the Generative AI Research Lab (GAIR), initially explored complex techniques like tree search and reinforcement learning in attempts to replicate o1's capabilities. However, they found, as documented in their "o1 Replication Journey" papers, that knowledge distillation — training a smaller model to mimic o1's outputs – was surprisingly effective. This highlighted the power of distillation in this context.

Alibaba also released reasoning versions of its Qwen LLMs.

2025

In January 2025, DeepSeek released R1, a model competitive with o1 at lower cost, highlighting the effectiveness of GRPO.^[21] OpenAI subsequently released o3-mini, followed by Deep Research which is based on o3.^[22] The power of distillation was further demonstrated by s1-32B, achieving strong performance with budget forcing and scaling techniques.^[23]

Applications

Mathematical and logical reasoning

Reflection enables LLMs to solve multi-step problems, demonstrated on benchmarks like FrontierMath,^[24] GSM8K (mathematical word problems), GPQA Diamond (PhD-level Science Questions) and Big-Bench Hard (challenging reasoning tasks). A model might initially produce an incorrect solution but, through self-reflection, identify the flawed step and generate a corrected answer.

Vision-language tasks

Frameworks like R3V allow vision-language models to iteratively refine reasoning on complex multimodal tasks. In visual question answering, the model might first generate a plausible but incorrect answer based on a superficial understanding. Through reflection, it could identify inconsistencies between its answer and image details, leading to a revised, more accurate response.^[25]

General problem solving

Enhanced reflection leads to improved coherence, long-term planning, and reduced hallucinations. This is valuable in tasks requiring planning, sequential decision-making, or creative problem-solving, like writing code, composing stories, or designing experiments.

Models

OpenAI

o3 and o3-mini

o1-preview and o1

Gemini

2.0 Flash Thinking Experimental

DeepSeek

R1 (based on V3)

R1-Lite-Preview (test version based on V2.5)

Qwen

QvQ-72B-Preview — an experimental visual reasoning model launched on December 24, 2024, which integrates image understanding with verbal chain-of-thought reasoning.

QwQ-32B-Preview — an experimental text-based reasoning model released in late November 2024 that emphasizes complex, step-by-step analysis.

Anthropic

Claude Sonnet 3.7 has an adjustable amount of 'thinking' tokens.

xAI

Grok 3

Experiments

Llama 3B scaling test-time compute

On December 16, 2024, an experiment using a Llama 3B model demonstrated that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This result highlighted that improved inference strategies can unlock latent reasoning capabilities even in compact models.^[26]

Criticism and challenges

Computational cost

Reflective models require significantly more test-time compute than non-reasoning models. On the AIME benchmark, reasoning models were 10 to 74 times more expensive^[13] than non-reasoning counterparts. The cheapest model, Gemini 2.0-Flash, cost just $0.06 per benchmark.

Latency issues

Reflective reasoning significantly increases response times, with current models taking anywhere from three seconds to several minutes to generate an answer. As reasoning depth improves, future models may require even longer processing times.

References

↑ 1 Scaling by Thinking in Continuous Space. Retrieved 2025-02-14 from arxiv.org
↑ Training Large Language Models to Reason in a Continuous Latent Space. Retrieved 2025-02-14 from arxiv.org
↑ DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Retrieved 2025-02-23 from arxiv.org
↑ DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Retrieved 2025-02-23 from arxiv.org
↑ Lua error: bad argument #1 to "get" (not a valid title).
↑ Lua error: bad argument #1 to "get" (not a valid title).
↑ Lua error: bad argument #1 to "get" (not a valid title).
↑ DeepMind’s SCoRe shows LLMs can use their internal knowledge to correct their mistakes. (1 October 2024) Retrieved 20 February 2025 from VentureBeat
↑ Lua error: bad argument #1 to "get" (not a valid title).
↑ Lua error: bad argument #1 to "get" (not a valid title).
↑ OpenAI’s deep research can complete 26% of Humanity’s Last Exam. Greg McKenna. Retrieved 2025-03-16 from Fortune
↑ OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake. John-Anthony Disotto published. (2025-02-04) Retrieved 2025-03-16 from TechRadar
↑ ^13.0 ^13.1 MathArena. (2025-02-10) Retrieved 2025-02-10 from web.archive.org
↑ OpenAI o3-mini. (2025-01-31) Retrieved 2025-02-09 from OpenAI
↑ Try Deep Research and our new experimental model in Gemini, your AI assistant. (2024-12-11) Retrieved 2025-02-05 from Google
↑ Introducing deep research. (2025-02-02) Retrieved 2025-02-05 from OpenAI
↑ OpenAI unveils a new ChatGPT agent for 'deep research'. Anthony Ha. (2025-02-03) Retrieved 2025-02-06 from TechCrunch
↑ OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini. Benj Edwards. (2024-09-12) Retrieved 2025-02-06 from Ars Technica
↑ OpenAI confirms new frontier models o3 and o3-mini. (2024-12-20) Retrieved 2025-02-06 from VentureBeat
↑ The Bitter Lesson. Richard S. Sutton. Retrieved 2025-02-27 from Incomplete Ideas
↑ How does DeepSeek R1 really fare against OpenAI's best reasoning models?. Kyle Orland. (2025-01-28) Retrieved 2025-02-06 from Ars Technica
↑ Lua error: bad argument #1 to "get" (not a valid title).
↑ Lua error: bad argument #1 to "get" (not a valid title).
↑ FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI. Tamay Besiroglu. (2024-11-08) Retrieved 2025-02-08 from Epoch AI
↑ Lua error: bad argument #1 to "get" (not a valid title).
↑ Scaling test-time compute - a Hugging Face Space by HuggingFaceH4. Retrieved 2025-02-05 from huggingface.co

[1] 1 Scaling by Thinking in Continuous Space. Retrieved 2025-02-14 from arxiv.org

[2] Training Large Language Models to Reason in a Continuous Latent Space. Retrieved 2025-02-14 from arxiv.org

[3] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Retrieved 2025-02-23 from arxiv.org

[4] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Retrieved 2025-02-23 from arxiv.org

[5] Lua error: bad argument #1 to "get" (not a valid title).

[6] Lua error: bad argument #1 to "get" (not a valid title).

[7] Lua error: bad argument #1 to "get" (not a valid title).

[8] DeepMind’s SCoRe shows LLMs can use their internal knowledge to correct their mistakes. (1 October 2024) Retrieved 20 February 2025 from VentureBeat

[9] Lua error: bad argument #1 to "get" (not a valid title).

[10] Lua error: bad argument #1 to "get" (not a valid title).

[11] OpenAI’s deep research can complete 26% of Humanity’s Last Exam. Greg McKenna. Retrieved 2025-03-16 from Fortune

[12] OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake. John-Anthony Disotto published. (2025-02-04) Retrieved 2025-03-16 from TechRadar

[:1-13] 13.0 ^13.1 MathArena. (2025-02-10) Retrieved 2025-02-10 from web.archive.org

[14] OpenAI o3-mini. (2025-01-31) Retrieved 2025-02-09 from OpenAI

[15] Try Deep Research and our new experimental model in Gemini, your AI assistant. (2024-12-11) Retrieved 2025-02-05 from Google

[16] Introducing deep research. (2025-02-02) Retrieved 2025-02-05 from OpenAI

[:0-17] OpenAI unveils a new ChatGPT agent for 'deep research'. Anthony Ha. (2025-02-03) Retrieved 2025-02-06 from TechCrunch

[18] OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini. Benj Edwards. (2024-09-12) Retrieved 2025-02-06 from Ars Technica

[19] OpenAI confirms new frontier models o3 and o3-mini. (2024-12-20) Retrieved 2025-02-06 from VentureBeat

[20] The Bitter Lesson. Richard S. Sutton. Retrieved 2025-02-27 from Incomplete Ideas

[21] How does DeepSeek R1 really fare against OpenAI's best reasoning models?. Kyle Orland. (2025-01-28) Retrieved 2025-02-06 from Ars Technica

[22] Lua error: bad argument #1 to "get" (not a valid title).

[23] Lua error: bad argument #1 to "get" (not a valid title).

[24] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI. Tamay Besiroglu. (2024-11-08) Retrieved 2025-02-08 from Epoch AI

[25] Lua error: bad argument #1 to "get" (not a valid title).

[26] Scaling test-time compute - a Hugging Face Space by HuggingFaceH4. Retrieved 2025-02-05 from huggingface.co

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

Reflection (artificial intelligence)

Contents

Introduction

Techniques

Types of reflection

Post-hoc reflection

Iterative reflection

Intrinsic reflection

Process reward models and limitations

Benchmarks

Humanity's Last Exam

AIME

o3-mini performance

Integration with search capabilities

History

2024

2025

Applications

Mathematical and logical reasoning

Vision-language tasks

General problem solving

Models

OpenAI

Gemini

DeepSeek

Qwen

Anthropic

xAI

Experiments

Llama 3B scaling test-time compute

Criticism and challenges

Computational cost

Latency issues

See also

References

Navigation menu

Reflection (artificial intelligence)

Introduction

Techniques

Types of reflection

Post-hoc reflection

Iterative reflection

Intrinsic reflection

Process reward models and limitations

Benchmarks

Humanity's Last Exam

AIME

o3-mini performance

Integration with search capabilities

History

2024

2025

Applications

Mathematical and logical reasoning

Vision-language tasks

General problem solving

Models

OpenAI

Gemini

DeepSeek

Qwen

Anthropic

xAI

Experiments

Llama 3B scaling test-time compute

Criticism and challenges

Computational cost

Latency issues

See also

References

Navigation menu

Search