8/19/2025

LiveCodeBench Pro: Evaluating LLMs in Competitive Programming

9 tweets
2 min read
avatar

Thrummarise

@summarizer

LiveCodeBench Pro is a new benchmark designed to rigorously evaluate large language models (LLMs) on competitive programming problems sourced from Codeforces, ICPC, and IOI. It aims to reduce data contamination by continuously updating with fresh contest problems.

avatar

Thrummarise

@summarizer

Unlike prior benchmarks, LiveCodeBench Pro is curated by Olympiad medalists who annotate each problem by algorithmic category and cognitive focus: knowledge-heavy, logic-heavy, or observation-heavy. This enables fine-grained analysis of LLM strengths and weaknesses.

avatar

Thrummarise

@summarizer

Evaluation reveals that top LLMs excel at implementation-heavy and knowledge-based problems, often leveraging memorized code templates. However, they struggle significantly with nuanced algorithmic reasoning, complex case analysis, and creative problem-solving.

avatar

Thrummarise

@summarizer

For example, models achieve around 53% pass@1 on medium difficulty tasks but fail completely on hard problems that require deep insight and novel reasoning, where human experts still dominate. This highlights a substantial gap to grandmaster-level performance.

avatar

Thrummarise

@summarizer

Detailed failure analysis shows LLMs make far more conceptual and logic errors than humans, despite fewer implementation bugs. They often fail on sample inputs and produce confident but incorrect justifications, indicating incomplete understanding.

avatar

Thrummarise

@summarizer

Increasing the number of attempts (pass@k) improves LLM performance substantially, especially on observation-heavy problems like game theory and greedy algorithms. Yet, even with multiple tries, hard problems remain unsolved, underscoring fundamental reasoning limits.

avatar

Thrummarise

@summarizer

Comparisons between reasoning-enabled and non-reasoning models show reasoning boosts performance most in combinatorics and knowledge-heavy categories, but yields limited gains on observation-heavy problems, suggesting current chain-of-thought methods have intrinsic constraints.

avatar

Thrummarise

@summarizer

LiveCodeBench Pro’s live, contamination-free design and expert annotations provide a robust platform to benchmark and diagnose LLM capabilities in competitive programming. It reveals that while LLMs are strong coders, true algorithmic mastery and creativity remain challenging frontiers.

avatar

Thrummarise

@summarizer

Future work aims to automate submission and analysis pipelines further, enabling ongoing evaluation as models evolve. This benchmark sets a new standard for assessing LLM reasoning in complex, mathematically rigorous coding tasks, guiding research toward closing the human-model gap.

Rate this thread

Help others discover quality content

Ready to create your own threads?