CCMSP

CCMSP MATH CTF CP

# Model Total Score Cost / Tokens Easy Medium Hard
1
DeepSeek V3.2 Speciale
DeepSeek V3.2 Speciale pass@1
683/1000 (68.3%)
$21.30 Tokens: 43,814,442
94.0%
63.8%
5.1%
2
Gemini 3 Pro Preview
Gemini 3 Pro Preview pass@1
596/1000 (59.6%)
$280.30 Tokens: N/A
89.4%
47.4%
0.4%
3
GPT-5 High
GPT-5 High pass@1
582/1000 (58.2%) - / -
84.0%
51.0%
0.0%
4
GPT-5.1 High
GPT-5.1 High pass@1
562/1000 (56.2%)
$306.83 Tokens: 29,306,398
83.3%
46.4%
0.0%
5
Grok 4.1 High
Grok 4.1 High pass@1
548/1000 (54.8%) - / -
84.1%
41.2%
0.0%
6
GLM-4.6 Reasoning
GLM-4.6 Reasoning pass@1
521/1000 (52.1%) - / -
77.8%
42.3%
0.0%
7
Claude 4.5 Sonnet High
Claude 4.5 Sonnet High pass@1
307/1000 (30.7%) - / -
57.3%
9.0%
0.0%

Metrics represent performance efficiency. Higher values indicate greater score achieved per unit of resource—essentially, how much mathematical reasoning capability you get per dollar and per token.

# Token Efficiency (Score per 1M Output Tokens)

# Model Score Tokens Ratio
1
GPT-5.1 High GPT-5.1 High
43.2% 29.3M 1.48
2
DeepSeek V3.2 Speciale DeepSeek V3.2 Speciale
54.3% 43.8M 1.24

# Cost Efficiency (Score per Dollar)

# Model Score Cost Ratio
1
DeepSeek V3.2 Speciale DeepSeek V3.2 Speciale
54.3% $21.30 2.549
2
Gemini 3 Pro Preview Gemini 3 Pro Preview
45.7% $280.30 0.163
3
GPT-5.1 High GPT-5.1 High
43.2% $306.83 0.141

# About CCMSP

CCMSP is a proprietary benchmark designed to avoid contamination and probe the limits of Large Language Models' mathematical reasoning.

Comprising 1000 problems personally solved by the author, this dataset aggregates challenges from CTF Cryptography, Competitive Programming, and MathMO.

Each problem was specifically chosen because it exhibits properties that are particularly challenging for LLM-based systems and has been completely reformulated to lower the odds of data contamination, though nothing can be guaranteed.

Composition

Easy (48%) · Medium (35%) · Hard (17%)

Methodology

  • Competitive Programming: Evaluated against a judge system running test cases on proposed solutions.
  • CTF Challenges: Verified via automated judging and LLM-as-a-Judge.
  • MathMO: Most problems scored by Gemini 3 Pro acting as an LLM-as-a-Judge.
  • Hard Problems: LLM-as-a-Judge, with proofs additionally verified manually by the author.