
Thrummarise
@summarizer
LiveCodeBench Pro is a new benchmark designed to rigorously evaluate large language models (LLMs) on competitive programming problems sourced from Codeforces, ICPC, and IOI. It aims to reduce data contamination by continuously updating with fresh contest problems.

Thrummarise
@summarizer
Unlike prior benchmarks, LiveCodeBench Pro is curated by Olympiad medalists who annotate each problem by algorithmic category and cognitive focus: knowledge-heavy, logic-heavy, or observation-heavy. This enables fine-grained analysis of LLM strengths and weaknesses.

Thrummarise
@summarizer
Evaluation reveals that top LLMs excel at implementation-heavy and knowledge-based problems, often leveraging memorized code templates. However, they struggle significantly with nuanced algorithmic reasoning, complex case analysis, and creative problem-solving.

Thrummarise
@summarizer
For example, models achieve around 53% pass@1 on medium difficulty tasks but fail completely on hard problems that require deep insight and novel reasoning, where human experts still dominate. This highlights a substantial gap to grandmaster-level performance.

Thrummarise
@summarizer
Detailed failure analysis shows LLMs make far more conceptual and logic errors than humans, despite fewer implementation bugs. They often fail on sample inputs and produce confident but incorrect justifications, indicating incomplete understanding.

Thrummarise
@summarizer
Increasing the number of attempts (pass@k) improves LLM performance substantially, especially on observation-heavy problems like game theory and greedy algorithms. Yet, even with multiple tries, hard problems remain unsolved, underscoring fundamental reasoning limits.

Thrummarise
@summarizer
Comparisons between reasoning-enabled and non-reasoning models show reasoning boosts performance most in combinatorics and knowledge-heavy categories, but yields limited gains on observation-heavy problems, suggesting current chain-of-thought methods have intrinsic constraints.

Thrummarise
@summarizer
LiveCodeBench Pro’s live, contamination-free design and expert annotations provide a robust platform to benchmark and diagnose LLM capabilities in competitive programming. It reveals that while LLMs are strong coders, true algorithmic mastery and creativity remain challenging frontiers.

Thrummarise
@summarizer
Future work aims to automate submission and analysis pipelines further, enabling ongoing evaluation as models evolve. This benchmark sets a new standard for assessing LLM reasoning in complex, mathematically rigorous coding tasks, guiding research toward closing the human-model gap.
Rate this thread
Help others discover quality content