Scale AI logo
SEAL Logo
SEAL Logo

Leaderboards

SEAL LLM Leaderboards evaluate frontier LLM capabilities. These leaderboards provide insight into models through robust datasets and precise criteria to benchmark the latest AI advancements.

MCP Atlas
Evaluating real-world tool use through the Model Context Protocol (MCP)
1

claude-opus-4-5-20251101

62.30±1.76

1

gpt-5.2-2025-12-11

NEW

60.57±1.62

3

gemini-3-flash-preview

NEW

57.40±1.48

View Full Ranking
SWE-Bench Pro (Public Dataset)
Evaluating long-horizon software engineering tasks in public open source repositories
1

claude-opus-4-5-20251101

45.89±3.60

1

claude-4-5-Sonnet

43.60±3.60

1

gemini-3-pro-preview

43.30±3.60

View Full Ranking
SWE-Bench Pro (Private Dataset)
Evaluating long-horizon software engineering tasks in commercial-grade private repositories
1

gpt-5.2-2025-12-11

NEW

23.81±5.09

1

claude-opus-4-5-20251101

NEW

23.44±5.07

1

gemini-3-pro-preview

NEW

17.95±4.78

View Full Ranking
SciPredict
Forecasting scientific experiment outcomes
1

gemini-3-pro-preview

25.27±1.92

1

claude-opus-4-5-20251101

23.05±0.51

1

claude-opus-4-1-20250805

22.22±1.48

View Full Ranking
Humanity's Last Exam
Challenging LLMs at the frontier of human knowledge
1

37.52±1.90

2

31.64±1.82

3

27.80±1.76

View Full Ranking
Humanity's Last Exam (Text Only)
Challenging LLMs at the frontier of human knowledge
1

37.72±2.04

2

33.32±1.99

3

28.50±1.90

View Full Ranking
AudioMultiChallenge
Evaluating spoken dialogue systems in multi-turn interaction
1

gemini-3-pro-preview*

54.65±4.57

1

gemini-2.5-pro*

46.90±4.58

2

gemini-2.5-flash (Thinking)*

40.04±4.50

View Full Ranking
Professional Reasoning Benchmark - Finance
Evaluating Professional Reasoning in Finance
1

gpt-5

51.32±0.17

1

gpt-5-pro

51.06±0.59

3

o3-pro

49.08±0.79

View Full Ranking
Professional Reasoning Benchmark - Legal
Evaluating Professional Reasoning in Legal Practice
1

gpt-5-pro

49.89±0.36

1

o3-pro

49.67±0.50

1

gpt-5.1-thinking

49.33±0.38

View Full Ranking
Remote Labor Index (RLI)
Evaluating AI agents ability to perform real-world, economically valuable remote work
1

3.75±0.00

2

2.50±0.00

2

2.50±0.00

View Full Ranking
PropensityBench
Simulating real-world pressure to choose between safe or harmful behavior
1

o3-2025-04-16

10.50±0.60

2

claude-sonnet-4-20250514

12.20±0.20

3

o4-mini-2025-04-16

15.80±0.40

View Full Ranking
VisualToolBench (VTB)
Evaluating how LLMs can dynamically interact with and reason about visual information
1

gemini-3-pro-preview

NEW

26.85±0.54

1

gpt-5-2025-08-07-thinking

18.68±0.25

2

gpt-5-2025-08-07

16.96±0.06

View Full Ranking
MultiNRC
Multilingual Native Reasoning Evaluation Benchmark for LLMs
1

65.20±1.24

2

58.96±2.97

3

52.13±3.01

View Full Ranking
MultiChallenge
Assessing models across diverse, interdisciplinary challenges
1

63.77±1.53

2

59.15±1.39

2

59.10±2.02

View Full Ranking
Fortress
Frontier Risk Evaluation for National Security and Public Safety
1

8.24±1.93

1

9.63±2.11

2

12.80±2.36

View Full Ranking
MASK
Evaluate model honesty when pressured to lie
1

96.13±0.57

1

Claude Sonnet 4 (Thinking)

95.33±2.29

1

94.20±1.79

View Full Ranking
EnigmaEval
Evaluating model performance on complex, multi-step reasoning tasks
1

18.75±2.22

1

18.24±2.20

3

13.09±1.92

View Full Ranking
VISTA
Vision-Language Understanding benchmark for multimodal models
1

Gemini 2.5 Pro Experimental (March 2025)

54.65±1.46

1

gemini-2.5-pro-preview-06-05

54.63±0.55

2

gpt-5-pro-2025-10-06

52.39±1.07

View Full Ranking
TutorBench
Evaluating model performance on common tutoring tasks for high school and AP-level subjects
1

gemini-2.5-pro-preview-06-05

55.65±1.11

1

gpt-5-2025-08-07

55.33±1.02

1

o3-pro-2025-06-10

54.62±1.02

View Full Ranking

Frontier AI Evaluations

We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities —while continuously evaluating the latest frontier models.

Scaling with Human Expertise

Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations—ensuring efficiency and alignment with human judgment.

Robust Datasets

Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting and open-source datasets for broad benchmarking and comparability.

Run evaluations on frontier AI capabilities

If you'd like to add your model to this leaderboard or a future version, please contact leaderboards@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.