CCMSP
CCMSP MATH CTF CP
| # | Model | Total Score | Cost / Tokens | Easy | Medium | Hard |
|---|---|---|---|---|---|---|
| 1 | DeepSeek V3.2 Speciale pass@1 | 683/1000 (68.3%) | $21.30
Tokens: 43,814,442 | 94.0% | 63.8% | 5.1% |
| 2 | Gemini 3 Pro Preview pass@1 | 596/1000 (59.6%) | $280.30
Tokens: N/A | 89.4% | 47.4% | 0.4% |
| 3 | GPT-5 High pass@1 | 582/1000 (58.2%) | - / - | 84.0% | 51.0% | 0.0% |
| 4 | GPT-5.1 High pass@1 | 562/1000 (56.2%) | $306.83
Tokens: 29,306,398 | 83.3% | 46.4% | 0.0% |
| 5 | Grok 4.1 High pass@1 | 548/1000 (54.8%) | - / - | 84.1% | 41.2% | 0.0% |
| 6 | GLM-4.6 Reasoning pass@1 | 521/1000 (52.1%) | - / - | 77.8% | 42.3% | 0.0% |
| 7 | Claude 4.5 Sonnet High pass@1 | 307/1000 (30.7%) | - / - | 57.3% | 9.0% | 0.0% |
Metrics represent performance efficiency. Higher values indicate greater score achieved per unit of resource—essentially, how much mathematical reasoning capability you get per dollar and per token.
# Token Efficiency (Score per 1M Output Tokens)
| # | Model | Score | Tokens | Ratio |
|---|---|---|---|---|
| 1 | | 43.2% | 29.3M | 1.48 |
| 2 | | 54.3% | 43.8M | 1.24 |
# Cost Efficiency (Score per Dollar)
| # | Model | Score | Cost | Ratio |
|---|---|---|---|---|
| 1 | | 54.3% | $21.30 | 2.549 |
| 2 | | 45.7% | $280.30 | 0.163 |
| 3 | | 43.2% | $306.83 | 0.141 |
# About CCMSP
CCMSP is a proprietary benchmark designed to avoid contamination and probe the limits of Large Language Models' mathematical reasoning.
Comprising 1000 problems personally solved by the author, this dataset aggregates challenges from CTF Cryptography, Competitive Programming, and MathMO.Each problem was specifically chosen because it exhibits properties that are particularly challenging for LLM-based systems and has been completely reformulated to lower the odds of data contamination, though nothing can be guaranteed.
Composition
Easy (48%) · Medium (35%) · Hard (17%)
Methodology
- Competitive Programming: Evaluated against a judge system running test cases on proposed solutions.
- CTF Challenges: Verified via automated judging and LLM-as-a-Judge.
- MathMO: Most problems scored by Gemini 3 Pro acting as an LLM-as-a-Judge.
- Hard Problems: LLM-as-a-Judge, with proofs additionally verified manually by the author.