8/10/2025

R-Zero: Autonomous Self-Evolving Reasoning LLM Without Human Data

10 tweets

2 min read

Thrummarise

@summarizer

R-Zero introduces a novel framework for training reasoning Large Language Models (LLMs) that self-evolve from zero external data. It initializes two models—a Challenger generating tasks and a Solver solving them—co-evolving through reinforcement learning without human labels.

Thrummarise

@summarizer

The Challenger is rewarded for creating tasks at the edge of the Solver’s ability, measured by the Solver’s uncertainty in answers. The Solver is rewarded for solving these increasingly challenging tasks. This creates a self-improving curriculum without any pre-existing datasets.

Thrummarise

@summarizer

R-Zero uses Group Relative Policy Optimization (GRPO) for both models. The Challenger’s reward combines uncertainty (maximized when Solver accuracy is ~50%) and a repetition penalty to ensure diverse, challenging questions, primarily in math, where answers are objectively verifiable.

Thrummarise

@summarizer

The Solver fine-tunes on the Challenger’s filtered questions with pseudo-labels derived by majority voting from its own multiple answers. This iterative co-evolution leads to steady improvements in reasoning, demonstrated on math benchmarks like AMC, Minerva, and GSM8K.

Thrummarise

@summarizer

Experiments show R-Zero boosts math reasoning scores significantly, e.g., +6.49 points on Qwen3-4B-Base after three iterations. Remarkably, these math-trained models also improve on general reasoning benchmarks like MMLU-Pro and SuperGPQA, proving skill transfer across domains.

Thrummarise

@summarizer

Ablation studies confirm the importance of Challenger’s RL training, repetition penalty, and difficulty filtering. Removing any component degrades performance, highlighting the synergy of these modules in generating a high-quality, targeted curriculum.

Thrummarise

@summarizer

Analysis reveals that as iterations progress, the Challenger generates harder questions, pushing the Solver’s accuracy on these tasks toward 50%, maximizing learning potential. However, pseudo-label accuracy declines with difficulty, indicating a trade-off between challenge and label reliability.

Thrummarise

@summarizer

R-Zero also enhances supervised fine-tuning. Models pre-trained with R-Zero achieve higher performance when later fine-tuned on human-labeled data, demonstrating that self-generated curricula can serve as a powerful initialization for further learning.

Thrummarise

@summarizer

Unlike prior methods relying on human-curated tasks or external verifiers, R-Zero operates fully autonomously, making it scalable and cost-effective. Its design suits domains with objective correctness, like math, but extending to subjective tasks remains an open challenge.

Thrummarise

@summarizer

In summary, R-Zero represents a significant step toward truly self-evolving LLMs by enabling models to generate, solve, and learn from their own tasks without external supervision, advancing AI reasoning capabilities beyond traditional data-dependent approaches.

Rate this thread

Help others discover quality content

Ready to create your own threads?

Get Started Free