Open-LLM-Leaderboard:

From Multi-choice to Openstyle Questions for LLMs Evaluation, Benchmark, and Arena

Aidar Myrzakhan* Sondos Mahmoud Bsharat* Zhiqiang Shen* 

*joint first author & equal contribution

VILA Lab , Mohamed bin Zayed University of AI (MBZUAI)   

Benchmark

The Open-style Question Benchmark (OSQ-bench) is at the forefront of refining how large language models (LLMs) are evaluated. Moving away from traditional multiple-choice questions (MCQs), OSQ-bench introduces open-style questions to eliminate common biases and enhance the assessment's accuracy. This section details the benchmark's design, showcasing the extensive range and quality of questions it includes and explaining the substantial benefits this format offers. Discover how OSQ-bench is setting new standards in measuring LLMs' true comprehension and reasoning abilities, making it a cornerstone for future advancements in AI evaluation.






Statistics and Distributions

Table 2: Statistics on open-style questions across different datasets.
Benchmarks #Evaluated #Open-Style Average Question Length
MMLU 14,042 7,879 36.6
ARC 3,548 3,241 21.1
MedMCQA 4,183 2,336 14.1
Race 4,934 3,528 10.0
OpenbookQA 1,000 494 10.3
WinoGrande 1,267 1,267 19.1
HellaSwag 10,042 3,945 40.1
PiQA 1,838 700 7.1
Overall 42,075 24,104 19.05

Diversity

Quality

Property and Advantage