CCMSP

CCMSP MATH CTF CP

#	Model	Total Score	Cost / Tokens	Easy	Medium	Hard
1	DeepSeek V3.2 Speciale pass@1	683/1000 (68.3%)	$21.30 Tokens: 43,814,442	94.0%	63.8%	5.1%
2	Gemini 3 Pro Preview pass@1	596/1000 (59.6%)	$280.30 Tokens: N/A	89.4%	47.4%	0.4%
3	GPT-5 High pass@1	582/1000 (58.2%)	- / -	84.0%	51.0%	0.0%
4	GPT-5.1 High pass@1	562/1000 (56.2%)	$306.83 Tokens: 29,306,398	83.3%	46.4%	0.0%
5	Grok 4.1 High pass@1	548/1000 (54.8%)	- / -	84.1%	41.2%	0.0%
6	GLM-4.6 Reasoning pass@1	521/1000 (52.1%)	- / -	77.8%	42.3%	0.0%
7	Claude 4.5 Sonnet High pass@1	307/1000 (30.7%)	- / -	57.3%	9.0%	0.0%

Metrics represent performance efficiency. Higher values indicate greater score achieved per unit of resource—essentially, how much mathematical reasoning capability you get per dollar and per token.

# Token Efficiency (Score per 1M Output Tokens)

#	Model	Score	Tokens	Ratio
1	GPT-5.1 High	43.2%	29.3M	1.48
2	DeepSeek V3.2 Speciale	54.3%	43.8M	1.24

# Cost Efficiency (Score per Dollar)

#	Model	Score	Cost	Ratio
1	DeepSeek V3.2 Speciale	54.3%	$21.30	2.549
2	Gemini 3 Pro Preview	45.7%	$280.30	0.163
3	GPT-5.1 High	43.2%	$306.83	0.141

# About CCMSP

CCMSP is a proprietary benchmark designed to avoid contamination and probe the limits of Large Language Models' mathematical reasoning.

Comprising 1000 problems personally solved by the author, this dataset aggregates challenges from CTF Cryptography, Competitive Programming, and MathMO.

Each problem was specifically chosen because it exhibits properties that are particularly challenging for LLM-based systems and has been completely reformulated to lower the odds of data contamination, though nothing can be guaranteed.

Composition

Easy (48%) · Medium (35%) · Hard (17%)

Methodology

Competitive Programming: Evaluated against a judge system running test cases on proposed solutions.
CTF Challenges: Verified via automated judging and LLM-as-a-Judge.
MathMO: Most problems scored by Gemini 3 Pro acting as an LLM-as-a-Judge.
Hard Problems: LLM-as-a-Judge, with proofs additionally verified manually by the author.