MMLU-Mobile Bench

Welcome to Mobile-MMLU

Questions asked on mobile devices often differ from those on computers. For instance, users might ask how to cook a dish with a phone in a kitchen, whereas queries about web programming are much more likely to be asked on computers.

Across 80 diverse fields (Health, Technology, etc.) and with more than 16,000 questions, Mobile-MMLU evaluates LLM's capabilities in practical mobile scenarios, from basic life skills to complex problem-solving.

Explore Mobile-MMLU Benchmark

Motivation

Why Apple Intelligence and Mobile Intelligence Are Important for LLMs on Mobile Devices?

On-Device Efficiency:

Mobile devices, such as smartphones, tablets, and wearables, have limited computational resources compared to high-end servers. Apple Intelligence and other mobile-focused AI frameworks prioritize optimization techniques to run small-scale large language models (LLMs) directly and efficiently on these constrained environments. Techniques like model compression, quantization, and efficient architectures (e.g., linear complexity models) are crucial for enabling LLMs to perform effectively on mobile devices. However, these models have limited capabilities.
Personalization and Privacy:

On-device LLMs can process user data locally without sending sensitive information to cloud servers. This ensures better privacy and security, a feature that aligns with Apple’s emphasis on user-centric privacy through its Apple Intelligence ecosystem. Mobile Intelligence allows models to adapt to user preferences, habits, and behaviors in real-time while safeguarding data.
Accessibility and Ubiquity:

Mobile devices are used globally and form a significant part of everyday life. Integrating LLMs into mobile platforms broadens their accessibility, enabling users in low-bandwidth regions or offline environments to experience the benefits of AI seamlessly. Apple Intelligence exemplifies how optimized LLMs can democratize advanced capabilities for mainstream users.

Why a Mobile LLM Benchmark Is Necessary to Identify Good Mobile LLMs?

Performance Evaluation in Mobile-Centric Settings:

A dedicated mobile LLM benchmark is essential to evaluate models under the real-world limitations of mobile devices, such as limited memory, energy constraints, and computational power. Traditional benchmarks often target cloud-based models without accounting for these restrictions.
Comparing Mobile Model:

A mobile-centric benchmark allows researchers and developers to compare different mobile LLM architectures, optimizations, and compression techniques fairly. Without such benchmarks, it is challenging to assess which models are suitable for mobile deployment.
Identifying Mobile Use-Specific Solutions:

Mobile devices are integral to daily life, and integrating LLMs allows for personalized, real-time experiences such as smarter virtual assistants, contextual recommendations, and language understanding for apps. Apple Intelligence, for example, enables LLMs to adapt to user behaviors, provide tailored responses, and improve applications like Siri, search, and messaging.
Improving End-User Experience:

Ultimately, a mobile LLM benchmark helps identify models that deliver the best user experience on mobile devices. This includes fluid interactions, responsiveness, and minimal battery drain, making LLM-powered features practical and usable on a large scale.

Benchmark Overview

Our dataset encompasses 80 fields, featuring a total of 16,186 questions, including scenario-based questions. Below is a subset of 6 fields showcasing the diversity of question types.

Ergonomics

Question:What is the best way to hold a smartphone to reduce strain?

Cooking and Recipes

Question: How do I plan meals for someone with gluten intolerance?

First Aid

Question:Someone is bleeding profusely from their leg after an accident; I applied pressure but it's not stopping; should I use a tourniquet, and how?

a) Yes, in cases of excessive bleeding that does not cease with direct pressure, employing a tourniquet could be an option to manage the bleeding risk. To use a tourniquet, position it below the bleeding site, ideally 2-3 inches below the wound but avoiding placement on joints like the shoulder or wrist. Ensure it is tightened until bleeding is significantly reduced and it is secured properly. Document the exact time of application as this timing is important for healthcare workers to understand how long it has been used. A tourniquet is meant for use in dire, life-endangering situations, and it is imperative to get professional medical aid as soon as achievable. (Tourniquets must be placed above the wound, not below).
b) Yes, when facing severe blood loss that cannot be controlled with direct pressure, using a tourniquet might be an effective method to slow down the bleeding. To deploy a tourniquet, place it below the site of the bleeding injury, ideally 2-3 inches away from the wound but not over a joint such as an ankle or wrist. Tighten the device so the bleeding slows significantly and make sure it is fastened securely. It is crucial to record the time the tourniquet was applied, as medical personnel need to be aware of the duration it has been applied. Only use a tourniquet in critical emergencies, and ensure to seek professional medical assistance without delay. (Bleeding must be stopped entirely, and placement should be above the wound).
c) Yes, in situations where there is major bleeding that does not stop despite applying direct pressure, applying a tourniquet might be considered a method to attempt controlling the bleeding. To use a tourniquet, place it directly on the site of the bleeding, ideally wrapping it tightly around the wound itself. Avoid placing it on joints such as the knee or elbow, and secure it firmly. Remember to note the exact time when the tourniquet was applied, as medical staff will need this information to assess the duration it has been in place. A tourniquet should only be employed in critical, life-threatening situations, and it is crucial to contact medical professionals immediately for further assistance. (Tourniquets should never be placed on the wound; they must go above the site).
d) Yes, in situations where there is severe bleeding that does not stop with direct pressure, applying a tourniquet can be an effective method to control the bleeding. To use a tourniquet, place it above the site of the bleeding, ideally 2-3 inches above the wound but not on a joint like the knee or elbow. Tighten it until the bleeding stops and secure it in place. Note the time the tourniquet was applied, as it is important for medical personnel to know how long it has been in place. A tourniquet should only be used in life-threatening situations, and professional medical help should be sought as soon as possible.

Basic life skills

Question:What is the best way to store spices?

Nutrition and Diet

Question:How can I reduce my intake of processed foods?

Culture

Question:What is the importance of the Maori facial tattoo, known as Ta Moko?

Mental Health

Question:How can I manage anticipatory anxiety?
(1) Participating in cognitive behavioral therapy and utilizing visualization techniques are useful strategies to manage anticipatory anxiety, offering long-term benefits. (2) Engaging in deep breathing exercises and regular yoga sessions can effectively manage anticipatory anxiety by promoting a sense of calm. (3) Developing a routine of physical exercise and adopting a balanced diet can help manage anticipatory anxiety by improving overall mental well-being. (4) Practicing mindfulness and relaxation techniques can help manage anticipatory anxiety.
Which of the statements given above are correct?

Home maintenance

Question:What should I check if my washing machine is making loud banging noises during the spin cycle?
(1) Ensure the washing machine is on a stable and flat surface, make sure the load is distributed evenly inside the drum, inspect the drum for any objects that might have been left behind, and check for worn or damaged drum bearings or suspension springs. (2) Check if the washing machine is balanced and level, ensure the load is evenly distributed, inspect the drum for foreign objects, and check for worn or damaged shock absorbers or suspension rods. (3) Check if the washing machine is correctly balanced and perfectly aligned, ensure the laundry load is properly distributed across the drum, inspect the inner drum for any foreign materials, and check for worn or damaged springs or suspension belts. (4) Verify that the washing machine is level and not tilted, confirm that the laundry is evenly spread within the drum, thoroughly inspect the drum for any foreign objects, and check for worn or damaged vibration dampers or suspension springs.
Which of the statements given above are correct?

Travel Planning

Question:I'm planning a trip from New York to Tokyo, departing on December 20th and returning on January 5th. Considering the time difference and international date line, what dates and times will my flights actually be, and how can I minimize jet lag?
(1) When traveling from New York to Tokyo, you should consider the time difference and crossing the International Date Line. New York is typically 16 hours behind Tokyo. If your flight from New York leaves on December 20th, you will likely arrive in Tokyo on December 20th, because the flight duration is approximately 13 hours non-stop. On your return trip, if you leave Tokyo on January 5th, you'll arrive back in New York on January 4th, because you gain a day crossing the International Date Line eastward. To minimize jet lag, adjust your sleep schedule a few days before departure to match Tokyo time, stay hydrated during the flight, avoid alcohol and caffeine, and get plenty of rest upon arrival. Adapt gradually to the local time zone once you arrive. (2) When traveling from New York to Tokyo, you must consider the time difference and the crossing of the International Date Line. New York is generally 18 hours behind Tokyo. If your flight from New York departs on December 20th, you will likely arrive in Tokyo on December 23rd, due to the time difference and a flight duration of approximately 14 hours non-stop. On your return journey, if you depart Tokyo on January 5th, you'll arrive back in New York on January 7th, because you lose a day crossing the International Date Line westward. To minimize jet lag, adjust your sleep schedule several days before departure to match Tokyo time, stay hydrated during the flight, avoid alcohol and caffeine, and get plenty of sunlight upon arrival. Rest and gradually adapt to the local time zone once you arrive. (3) When traveling from New York to Tokyo, you need to consider the time difference and the crossing of the International Date Line. New York is generally 12 hours behind Tokyo. If your flight from New York departs on December 20th, you will likely arrive in Tokyo on December 22nd, due to the time difference and flight duration, which is approximately 16 hours non-stop. On your return trip, if you depart Tokyo on January 5th, you'll arrive back in New York on January 6th, because you lose a day crossing the International Date Line westward. To minimize jet lag, start adjusting your sleep schedule a week before departure to match Tokyo time, stay hydrated during the flight, avoid heavy meals and caffeine, and get plenty of sunlight upon arrival. Rest and gradually adapt to the local time zone once you arrive. (4) When traveling from New York to Tokyo, you need to consider the time difference and the crossing of the International Date Line. New York is typically 14 hours behind Tokyo. If your flight from New York departs on December 20th, you will likely arrive in Tokyo on December 21st, due to the time difference and flight duration, which is approximately 14 hours non-stop. On your return trip, if you depart Tokyo on January 5th, you'll arrive back in New York on the same day, January 5th, because you gain a day crossing the International Date Line eastward. To minimize jet lag, adjust your sleep schedule a few days before departure to match Tokyo time, stay hydrated during the flight, avoid alcohol and caffeine, and get plenty of sunlight upon arrival. Rest and gradually adapt to the local time zone once you arrive.

Which of the statements given above are correct?

Benchmark Statistics

Explore our comprehensive dataset covering 80 fields, from technical disciplines to creative domains. Gain insight into the diversity and depth of the data that drives our benchmark.

Dataset Coverage and Structure

Our dataset encompasses 80 fields, featuring a total of 16,186 questions, carefully curated to evaluate mobile-compatible language models. Each field includes multiple-choice questions designed to test both fundamental knowledge and real-world applications. Below is a comprehensive visualization showing the distribution of questions across all fields, demonstrating the breadth and depth of our benchmark's coverage.

The visualization uses a three-layer sunburst chart where the innermost ring represents main categories (like "Academic & Learning", "Business & Career"), the middle ring shows subcategories (such as "Health & Wellness" including Mental Health and Physical Fitness), and the outermost ring displays all 80 fields with their question distribution. Hover over any segment to see detailed statistics, with segment sizes proportional to question counts. The consistent color coding across layers helps track relationships between categories and their subfields, highlighting the systematic coverage of mobile-centric knowledge domains.

Topic Distribution Analysis

This visualization demonstrates the topic distribution across Mobile-MMLU, MMLU, and MMLU-Pro benchmarks. From the scatter plot, we can observe that Mobile-MMLU topics occupy a distinct semantic space compared to the topics of MMLU and MMLU-Pro benchmark. This clear separation in the topic distribution highlights Mobile-MMLU's unique focus on practical, mobile-relevant scenarios, complementing existing benchmarks rather than overlapping with them. The distinct clustering pattern validates Mobile-MMLU's contribution as a specialized benchmark tailored for evaluating mobile-oriented language models.

Comparative Dataset Analysis

This comparison highlights the differences between the basic statistics of Mobile-MMLU from MMLU and MMLU-Pro benchmark. Our benchmark features a greater number of questions, topics and and are more diverse, offering broader coverage and depth.

Total Questions Distribution

Total Topics Distribution
The figures below provide a comprehensive visualization of the dataset characteristics. For Mobile-MMLU and MMLU, we showcase the top 40 categories by both question count and word distribution, highlighting the depth and breadth of coverage in each domain. For MMLU-Pro, which focuses on specialized professional knowledge, we present the top 14 categories. These distributions reveal distinct patterns: Mobile-MMLU demonstrates a balanced distribution across practical, everyday topics, MMLU shows concentration in academic and professional fields, while MMLU-Pro exhibits focused coverage of specialized professional domains.

Mobile MMLU Questions Distribution

Mobile MMLU Word Distribution

MMLU Questions Distribution

MMLU Word Distribution

MMLU Pro Questions Distribution

MMLU Pro Word Distribution

Dataset Categories and Question Distribution

The following tables present a comprehensive statistical breakdown comparing Mobile-MMLU with MMLU and MMLU-Pro benchmarks. Each dataset table showcases its unique hierarchical structure - from broad categories to specific topics - along with detailed question counts. Mobile-MMLU offers extensive coverage with 80 fields and 16,186 questions, emphasizing practical, everyday knowledge areas. In contrast, MMLU contains 57 subjects with 15,573 questions focusing on academic disciplines, while MMLU-Pro features 14 specialized professional fields with 12,102 questions. This side-by-side comparison highlights how Mobile-MMLU complements existing benchmarks by introducing new categories specifically relevant to mobile use cases, while maintaining comprehensive coverage in terms of both breadth and depth.

Mobile-MMLU

Total Questions: 16,186

Daily Life Skills

Basic Life Skills 209

Time Management 211

Conflict Resolution 152

Event Planning 201

Food and Cooking

Cooking And Recipes 274

Food Safety 219

Nutrition And Diet 151

Digital Literacy

Digital Literacy 198

Technical Help 254

Mobile Customization 230

Social Media

Social Media 217

Digital Detox 196

Privacy and Security

Cybersecurity 208

Online Privacy 219

Health and Wellness

Mental Health 130

Physical Fitness 190

Medical And Health Knowledge 183

Ergonomics 204

Personal Growth

Creativity 210

Emotional Intelligence 133

Personal Branding 186

Career Development 166

Home and Living

Home Safety 189

Pet Care 208

Waste Management 207

Home Maintenance 261

Communication

Communication Skills 134

Social Etiquette 200

Public Speaking 158

Education

Education Techniques 146

Reading And Literature 248

Writing Skills 203

Linguistics 223

Personal Business

Personal Finance 223

E Commerce 186

Shopping 241

Accounting 205

Business Studies

Project Management 176

Human Resources 144

Business Management 167

Marketing And Sales Strarigies 162

Entertainment

Entertainment 207

Movie And Tv Show 230

Podcasting 211

Hobbies 208

Photography Basics 214

Everyday Safety

First Aid 221

Outdoor Survival Skills 277

Automotive Care 275

Family

Parenting 144

Relationships 173

Teens And Youth 161

Lifestyle

Fashion And Style 200

Travel Planning 214

Sports 188

Gardening And Horticulture 212

Environmental

Sustainable Living 178

Legal

Legal Rights 219

Law 206

Ethics

Ethical Living 171

Ethics 172

Arts and Design

Art Techniques And Architecture 175

Interior Design 200

Weather

Weather Forecasting 205

Culture and Religion

Cultural Awareness 243

Religious Studies 215

Holidays And Traditions 236

Critical Thinking

Formal Logic 210

Logical Fallacies 293

Basic Mathematics

Elementary Mathematics 254

High School Mathematics 200

Basic Statistics 219

Basic Sciences

Conceptual Physics 194

Science Fundamentals 191

Social Sciences

Social Sciences 207

Political Systems 193

World History 206

Geography 211

Miscellaneous

Global Facts 228

News And Information 203

MMLU

Total Questions: 15,573

Basic Mathematics

Elementary Mathematics 419

High School Mathematics 299

High School Statistics 239

Basic Sciences

High School Physics 168

Conceptual Physics 261

High School Chemistry 225

High School Biology 342

Advanced Mathematics

Abstract Algebra 111

College Mathematics 111

Advanced Sciences

College Physics 113

College Chemistry 108

Astronomy 168

College Biology 160

Medical Genetics 111

Virology 184

Computer Science

College Computer Science 111

High School Computer Science 109

Machine Learning 123

Security and Privacy

Computer Security 111

Security Studies 272

Engineering

Electrical Engineering 161

Food and Cooking

Nutrition 339

Religion and Culture

World Religions 190

Medical Sciences

Clinical Knowledge 294

College Medicine 195

Professional Medicine 303

Human Aging 246

Human Sexuality 143

Anatomy 149

Personal Business

Management 114

Marketing 259

Business Studies

Professional Accounting 313

Econometrics 126

Business Ethics 111

High School Macroeconomics 433

High School Microeconomics 264

Morality

Moral Disputes 384

Moral Scenarios 995

Professional Law

Professional Law 1704

International Law 134

Jurisprudence 119

Psychology

High School Psychology 605

Professional Psychology 681

Social Sciences

Sociology 223

Philosophy 345

High School Geography 220

High School Government And Politics 214

Public Relations 122

Us Foreign Policy 111

History

High School European History 183

High School Us History 226

High School World History 263

Prehistory 359

Critical Thinking

Formal Logic 140

Logical Fallacies 181

miscellaneous

Global Facts 110

Miscellaneous 869

MMLU-Pro

Total Questions: 12,102

Math 1,356

Physics 1,304

Chemistry 1,137

Law 1,106

Engineering 974

Other 929

Economics 849

Health 823

Psychology 803

Business 794

Biology 722

Philosophy 504

Computer Science 415

History 386

Our Methodology

Our methodology reflects a rigorous and systematic approach, designed to ensure not only quality but also practicality for mobile-based applications. By leveraging detailed planning and execution strategies, we aim to provide benchmark that resonate with real-world use cases. Every step has been designed and reviewed to uphold the highest standards of relevance, reliability, and scalability.

Our methodology is a detailed, multi-step process designed to ensure comprehensive and reliable benchmarks:

Field Selection: We began by conducting an in-depth search to identify fields that people frequently need or use in daily life, work, shopping, gaming, travel, or other scenarios. These fields were designed to align with mobile searches and user queries. These fields were gathered from diverse sources including Wikipedia, various websites, and large language models to ensure inclusivity and relevance.
Question Structuring and Human Annotation:
- The questions included standard questions to evaluate general knowledge and understanding, and challenging scenario-based questions crafted to simulate real-world situations and test critical thinking skills.
- The questions underwent multiple rounds of human annotation. This included generating the ground truth answers first and then creating multiple-choice questions (MCQs) based on the ground truth. The MCQs were crafted with the following principles:
  - Options were highly similar to the ground truth, differing only in specific keywords or subtle details to make them incorrect.
  - On average, MCQs were longer than the ground truth answers to test model precision.
  - Some questions included multiple correct answers for added complexity.
Quality Assurance: The generated questions were thoroughly reviewed for similarity and uniqueness. Any redundant or overly similar questions were removed. Additionally, each batch of questions underwent sampling and human verification to ensure accuracy and relevance.
Evaluation on LLMs: The curated dataset was used to evaluate various large language models across different scales, focusing particularly on those optimized for mobile usage. Evaluation metrics included latency, accuracy, and energy efficiency to ensure the benchmarks were practical for mobile environments.

This meticulous process ensures that our benchmarks are not only comprehensive but also reflective of real-world mobile usage scenarios.

Benchmark Results

Discover the latest results from our interactive visualizations, comparing LLMs on performance, accuracy, and efficiency. Dive deep into the metrics and make informed decisions about the future of mobile intelligence.

From the results we can see interesting observations. Firstly, it is higher variance in performance across models on the Mobile-MMLU benchmark compared to MMLU and MMLU-Pro. This increased variance is particularly valuable as it allows for a clearer distinction between model capabilities, especially for smaller-scale models (1-3B parameters), which are the primary focus of this benchmark. For example, Qwen2.5-3B, Phi-3.5-mini, and Llama-3.2-3B, all roughly same size, exhibit significant differences in their results. Another notable point is that strong performance on MMLU or MMLU-Pro does not guarantee comparable results on Mobile-MMLU. For instance, the Phi-3.5-mini model performs impressively on MMLU and MMLU-Pro but falls short on Mobile-MMLU. Conversely, the Qwen2.5-3B model exhibits relatively modest results on MMLU and MMLU-Pro but excels on Mobile-MMLU, even surpassing some of the 8B models in this benchmark.

Table:Performance comparison of models on 3 different benchmarks
We conduct hardware tests using the llama.cpp test framework, specifically modifying the official SwiftUI example from llama.cpp. These tests are performed on iOS to evaluate performance, with all LLMs converted to the GGUF format and quantized using the Q4_K_M method. This quantization technique is recognized for its efficiency in mobile-optimized models. All tests use Apple's Metal API as the backend, ensuring a consistent runtime environment across devices. Also, we record the on-device size of the GGUF files and parameter counts. It is worth noting that Q4_K_M-quantized models include additional quantization parameters, such as scaling factors, which result in a higher parameter count compared to the original models.

We conduct two types of inference tests: the Prefilling 512 test and the Text Generation 128 test. The Prefilling 512 test measures performance during the initial phase where 512 tokens are processed. The Text Generation 128 test evaluates performance during the text generation phase, where 128 tokens are generated. Each benchmark records performance metrics such as token throughput (measured in tokens per second) and peak memory usage (maximum RAM utilization).

Table: Performance of Apple iPhone 14
Below is a detailed heatmap visualization showing the performance of each model across different categories. The color intensity represents the accuracy percentage, with darker red indicating lower performance and darker blue indicating higher performance.

Table:Performance comparison of different models across various categories of our benchmark