Mobile-MMLU

A Mobile Intelligence Language Understanding Benchmark

Sondos Mahmoud Bsharat* Mukul Ranjan*Aidar Myrzakhan*Jiacheng LiuBowei GuoShengkun Tang
Zhuang LiuYuanzhi LiZhiqiang Shen 

*Co-first Author and Equal Contribution;  Corresponding Author and Project Lead

VILA Lab , Mohamed bin Zayed University of AI (MBZUAI) , Princeton University , Apple   
Leaderboard GitHub Logo Github Contact

Welcome to Mobile-MMLU

Questions asked on mobile devices often differ from those on computers. For instance, users might ask how to cook a dish with a phone in a kitchen, whereas queries about web programming are much more likely to be asked on computers.

Across 80 diverse fields (Health, Technology, etc.) and with more than 16,000 questions, Mobile-MMLU evaluates LLM's capabilities in practical mobile scenarios, from basic life skills to complex problem-solving.

Explore Mobile-MMLU Benchmark

Motivation

Why Apple Intelligence and Mobile Intelligence Are Important for LLMs on Mobile Devices?

  1. On-Device Efficiency:

    Mobile devices, such as smartphones, tablets, and wearables, have limited computational resources compared to high-end servers. Apple Intelligence and other mobile-focused AI frameworks prioritize optimization techniques to run small-scale large language models (LLMs) directly and efficiently on these constrained environments. Techniques like model compression, quantization, and efficient architectures (e.g., linear complexity models) are crucial for enabling LLMs to perform effectively on mobile devices. However, these models have limited capabilities.

  2. Personalization and Privacy:

    On-device LLMs can process user data locally without sending sensitive information to cloud servers. This ensures better privacy and security, a feature that aligns with Apple’s emphasis on user-centric privacy through its Apple Intelligence ecosystem. Mobile Intelligence allows models to adapt to user preferences, habits, and behaviors in real-time while safeguarding data.

  3. Accessibility and Ubiquity:

    Mobile devices are used globally and form a significant part of everyday life. Integrating LLMs into mobile platforms broadens their accessibility, enabling users in low-bandwidth regions or offline environments to experience the benefits of AI seamlessly. Apple Intelligence exemplifies how optimized LLMs can democratize advanced capabilities for mainstream users.

Why a Mobile LLM Benchmark Is Necessary to Identify Good Mobile LLMs?

  1. Performance Evaluation in Mobile-Centric Settings:

    A dedicated mobile LLM benchmark is essential to evaluate models under the real-world limitations of mobile devices, such as limited memory, energy constraints, and computational power. Traditional benchmarks often target cloud-based models without accounting for these restrictions.

  2. Comparing Mobile Model:

    A mobile-centric benchmark allows researchers and developers to compare different mobile LLM architectures, optimizations, and compression techniques fairly. Without such benchmarks, it is challenging to assess which models are suitable for mobile deployment.

  3. Identifying Mobile Use-Specific Solutions:

    Mobile devices are integral to daily life, and integrating LLMs allows for personalized, real-time experiences such as smarter virtual assistants, contextual recommendations, and language understanding for apps. Apple Intelligence, for example, enables LLMs to adapt to user behaviors, provide tailored responses, and improve applications like Siri, search, and messaging.

  4. Improving End-User Experience:

    Ultimately, a mobile LLM benchmark helps identify models that deliver the best user experience on mobile devices. This includes fluid interactions, responsiveness, and minimal battery drain, making LLM-powered features practical and usable on a large scale.

Benchmark Overview

Our dataset encompasses 80 fields, featuring a total of 16,186 questions, including scenario-based questions. Below is a subset of 6 fields showcasing the diversity of question types.

Ergonomics

Question:What is the best way to hold a smartphone to reduce strain?

Cooking and Recipes

Question: How do I plan meals for someone with gluten intolerance?

First Aid

Question:Someone is bleeding profusely from their leg after an accident; I applied pressure but it's not stopping; should I use a tourniquet, and how?

Basic life skills

Question:What is the best way to store spices?

Nutrition and Diet

Question:How can I reduce my intake of processed foods?

Culture

Question:What is the importance of the Maori facial tattoo, known as Ta Moko?

Mental Health

Question:How can I manage anticipatory anxiety?
(1) Participating in cognitive behavioral therapy and utilizing visualization techniques are useful strategies to manage anticipatory anxiety, offering long-term benefits. (2) Engaging in deep breathing exercises and regular yoga sessions can effectively manage anticipatory anxiety by promoting a sense of calm. (3) Developing a routine of physical exercise and adopting a balanced diet can help manage anticipatory anxiety by improving overall mental well-being. (4) Practicing mindfulness and relaxation techniques can help manage anticipatory anxiety.
Which of the statements given above are correct?

Home maintenance

Question:What should I check if my washing machine is making loud banging noises during the spin cycle?
(1) Ensure the washing machine is on a stable and flat surface, make sure the load is distributed evenly inside the drum, inspect the drum for any objects that might have been left behind, and check for worn or damaged drum bearings or suspension springs. (2) Check if the washing machine is balanced and level, ensure the load is evenly distributed, inspect the drum for foreign objects, and check for worn or damaged shock absorbers or suspension rods. (3) Check if the washing machine is correctly balanced and perfectly aligned, ensure the laundry load is properly distributed across the drum, inspect the inner drum for any foreign materials, and check for worn or damaged springs or suspension belts. (4) Verify that the washing machine is level and not tilted, confirm that the laundry is evenly spread within the drum, thoroughly inspect the drum for any foreign objects, and check for worn or damaged vibration dampers or suspension springs.
Which of the statements given above are correct?

Travel Planning

Question:I'm planning a trip from New York to Tokyo, departing on December 20th and returning on January 5th. Considering the time difference and international date line, what dates and times will my flights actually be, and how can I minimize jet lag?
(1) When traveling from New York to Tokyo, you should consider the time difference and crossing the International Date Line. New York is typically 16 hours behind Tokyo. If your flight from New York leaves on December 20th, you will likely arrive in Tokyo on December 20th, because the flight duration is approximately 13 hours non-stop. On your return trip, if you leave Tokyo on January 5th, you'll arrive back in New York on January 4th, because you gain a day crossing the International Date Line eastward. To minimize jet lag, adjust your sleep schedule a few days before departure to match Tokyo time, stay hydrated during the flight, avoid alcohol and caffeine, and get plenty of rest upon arrival. Adapt gradually to the local time zone once you arrive. (2) When traveling from New York to Tokyo, you must consider the time difference and the crossing of the International Date Line. New York is generally 18 hours behind Tokyo. If your flight from New York departs on December 20th, you will likely arrive in Tokyo on December 23rd, due to the time difference and a flight duration of approximately 14 hours non-stop. On your return journey, if you depart Tokyo on January 5th, you'll arrive back in New York on January 7th, because you lose a day crossing the International Date Line westward. To minimize jet lag, adjust your sleep schedule several days before departure to match Tokyo time, stay hydrated during the flight, avoid alcohol and caffeine, and get plenty of sunlight upon arrival. Rest and gradually adapt to the local time zone once you arrive. (3) When traveling from New York to Tokyo, you need to consider the time difference and the crossing of the International Date Line. New York is generally 12 hours behind Tokyo. If your flight from New York departs on December 20th, you will likely arrive in Tokyo on December 22nd, due to the time difference and flight duration, which is approximately 16 hours non-stop. On your return trip, if you depart Tokyo on January 5th, you'll arrive back in New York on January 6th, because you lose a day crossing the International Date Line westward. To minimize jet lag, start adjusting your sleep schedule a week before departure to match Tokyo time, stay hydrated during the flight, avoid heavy meals and caffeine, and get plenty of sunlight upon arrival. Rest and gradually adapt to the local time zone once you arrive. (4) When traveling from New York to Tokyo, you need to consider the time difference and the crossing of the International Date Line. New York is typically 14 hours behind Tokyo. If your flight from New York departs on December 20th, you will likely arrive in Tokyo on December 21st, due to the time difference and flight duration, which is approximately 14 hours non-stop. On your return trip, if you depart Tokyo on January 5th, you'll arrive back in New York on the same day, January 5th, because you gain a day crossing the International Date Line eastward. To minimize jet lag, adjust your sleep schedule a few days before departure to match Tokyo time, stay hydrated during the flight, avoid alcohol and caffeine, and get plenty of sunlight upon arrival. Rest and gradually adapt to the local time zone once you arrive.

Which of the statements given above are correct?

Benchmark Statistics

Explore our comprehensive dataset covering 80 fields, from technical disciplines to creative domains. Gain insight into the diversity and depth of the data that drives our benchmark.

    Dataset Coverage and Structure

    • Our dataset encompasses 80 fields, featuring a total of 16,186 questions, carefully curated to evaluate mobile-compatible language models. Each field includes multiple-choice questions designed to test both fundamental knowledge and real-world applications. Below is a comprehensive visualization showing the distribution of questions across all fields, demonstrating the breadth and depth of our benchmark's coverage.

      The visualization uses a three-layer sunburst chart where the innermost ring represents main categories (like "Academic & Learning", "Business & Career"), the middle ring shows subcategories (such as "Health & Wellness" including Mental Health and Physical Fitness), and the outermost ring displays all 80 fields with their question distribution. Hover over any segment to see detailed statistics, with segment sizes proportional to question counts. The consistent color coding across layers helps track relationships between categories and their subfields, highlighting the systematic coverage of mobile-centric knowledge domains.

    Topic Distribution Analysis

    • This visualization demonstrates the topic distribution across Mobile-MMLU, MMLU, and MMLU-Pro benchmarks. From the scatter plot, we can observe that Mobile-MMLU topics occupy a distinct semantic space compared to the topics of MMLU and MMLU-Pro benchmark. This clear separation in the topic distribution highlights Mobile-MMLU's unique focus on practical, mobile-relevant scenarios, complementing existing benchmarks rather than overlapping with them. The distinct clustering pattern validates Mobile-MMLU's contribution as a specialized benchmark tailored for evaluating mobile-oriented language models.

      Topic Semantic Space Distribution

    Comparative Dataset Analysis

    • This comparison highlights the differences between the basic statistics of Mobile-MMLU from MMLU and MMLU-Pro benchmark. Our benchmark features a greater number of questions, topics and and are more diverse, offering broader coverage and depth.

      Total Questions Analysis
      Total Questions Distribution
      Total Topics Analysis
      Total Topics Distribution
    • The figures below provide a comprehensive visualization of the dataset characteristics. For Mobile-MMLU and MMLU, we showcase the top 40 categories by both question count and word distribution, highlighting the depth and breadth of coverage in each domain. For MMLU-Pro, which focuses on specialized professional knowledge, we present the top 14 categories. These distributions reveal distinct patterns: Mobile-MMLU demonstrates a balanced distribution across practical, everyday topics, MMLU shows concentration in academic and professional fields, while MMLU-Pro exhibits focused coverage of specialized professional domains.

    Dataset Categories and Question Distribution

    • The following tables present a comprehensive statistical breakdown comparing Mobile-MMLU with MMLU and MMLU-Pro benchmarks. Each dataset table showcases its unique hierarchical structure - from broad categories to specific topics - along with detailed question counts. Mobile-MMLU offers extensive coverage with 80 fields and 16,186 questions, emphasizing practical, everyday knowledge areas. In contrast, MMLU contains 57 subjects with 15,573 questions focusing on academic disciplines, while MMLU-Pro features 14 specialized professional fields with 12,102 questions. This side-by-side comparison highlights how Mobile-MMLU complements existing benchmarks by introducing new categories specifically relevant to mobile use cases, while maintaining comprehensive coverage in terms of both breadth and depth.

      Mobile-MMLU
      Total Questions: 16,186
      Daily Life Skills
      Basic Life Skills 209
      Time Management 211
      Conflict Resolution 152
      Event Planning 201
      Food and Cooking
      Cooking And Recipes 274
      Food Safety 219
      Nutrition And Diet 151
      Digital Literacy
      Digital Literacy 198
      Technical Help 254
      Mobile Customization 230
      Social Media
      Social Media 217
      Digital Detox 196
      Privacy and Security
      Cybersecurity 208
      Online Privacy 219
      Health and Wellness
      Mental Health 130
      Physical Fitness 190
      Medical And Health Knowledge 183
      Ergonomics 204
      Personal Growth
      Creativity 210
      Emotional Intelligence 133
      Personal Branding 186
      Career Development 166
      Home and Living
      Home Safety 189
      Pet Care 208
      Waste Management 207
      Home Maintenance 261
      Communication
      Communication Skills 134
      Social Etiquette 200
      Public Speaking 158
      Education
      Education Techniques 146
      Reading And Literature 248
      Writing Skills 203
      Linguistics 223
      Personal Business
      Personal Finance 223
      E Commerce 186
      Shopping 241
      Accounting 205
      Business Studies
      Project Management 176
      Human Resources 144
      Business Management 167
      Marketing And Sales Strarigies 162
      Entertainment
      Entertainment 207
      Movie And Tv Show 230
      Podcasting 211
      Hobbies 208
      Photography Basics 214
      Everyday Safety
      First Aid 221
      Outdoor Survival Skills 277
      Automotive Care 275
      Family
      Parenting 144
      Relationships 173
      Teens And Youth 161
      Lifestyle
      Fashion And Style 200
      Travel Planning 214
      Sports 188
      Gardening And Horticulture 212
      Environmental
      Sustainable Living 178
      Legal
      Legal Rights 219
      Law 206
      Ethics
      Ethical Living 171
      Ethics 172
      Arts and Design
      Art Techniques And Architecture 175
      Interior Design 200
      Weather
      Weather Forecasting 205
      Culture and Religion
      Cultural Awareness 243
      Religious Studies 215
      Holidays And Traditions 236
      Critical Thinking
      Formal Logic 210
      Logical Fallacies 293
      Basic Mathematics
      Elementary Mathematics 254
      High School Mathematics 200
      Basic Statistics 219
      Basic Sciences
      Conceptual Physics 194
      Science Fundamentals 191
      Social Sciences
      Social Sciences 207
      Political Systems 193
      World History 206
      Geography 211
      Miscellaneous
      Global Facts 228
      News And Information 203
      MMLU
      Total Questions: 15,573
      Basic Mathematics
      Elementary Mathematics 419
      High School Mathematics 299
      High School Statistics 239
      Basic Sciences
      High School Physics 168
      Conceptual Physics 261
      High School Chemistry 225
      High School Biology 342
      Advanced Mathematics
      Abstract Algebra 111
      College Mathematics 111
      Advanced Sciences
      College Physics 113
      College Chemistry 108
      Astronomy 168
      College Biology 160
      Medical Genetics 111
      Virology 184
      Computer Science
      College Computer Science 111
      High School Computer Science 109
      Machine Learning 123
      Security and Privacy
      Computer Security 111
      Security Studies 272
      Engineering
      Electrical Engineering 161
      Food and Cooking
      Nutrition 339
      Religion and Culture
      World Religions 190
      Medical Sciences
      Clinical Knowledge 294
      College Medicine 195
      Professional Medicine 303
      Human Aging 246
      Human Sexuality 143
      Anatomy 149
      Personal Business
      Management 114
      Marketing 259
      Business Studies
      Professional Accounting 313
      Econometrics 126
      Business Ethics 111
      High School Macroeconomics 433
      High School Microeconomics 264
      Morality
      Moral Disputes 384
      Moral Scenarios 995
      Professional Law
      Professional Law 1704
      International Law 134
      Jurisprudence 119
      Psychology
      High School Psychology 605
      Professional Psychology 681
      Social Sciences
      Sociology 223
      Philosophy 345
      High School Geography 220
      High School Government And Politics 214
      Public Relations 122
      Us Foreign Policy 111
      History
      High School European History 183
      High School Us History 226
      High School World History 263
      Prehistory 359
      Critical Thinking
      Formal Logic 140
      Logical Fallacies 181
      miscellaneous
      Global Facts 110
      Miscellaneous 869
      MMLU-Pro
      Total Questions: 12,102
      Math 1,356
      Physics 1,304
      Chemistry 1,137
      Law 1,106
      Engineering 974
      Other 929
      Economics 849
      Health 823
      Psychology 803
      Business 794
      Biology 722
      Philosophy 504
      Computer Science 415
      History 386

Our Methodology

Our methodology reflects a rigorous and systematic approach, designed to ensure not only quality but also practicality for mobile-based applications. By leveraging detailed planning and execution strategies, we aim to provide benchmark that resonate with real-world use cases. Every step has been designed and reviewed to uphold the highest standards of relevance, reliability, and scalability.

Methodology Overview

Our methodology is a detailed, multi-step process designed to ensure comprehensive and reliable benchmarks:

  1. Field Selection: We began by conducting an in-depth search to identify fields that people frequently need or use in daily life, work, shopping, gaming, travel, or other scenarios. These fields were designed to align with mobile searches and user queries. These fields were gathered from diverse sources including Wikipedia, various websites, and large language models to ensure inclusivity and relevance.
  2. Question Structuring and Human Annotation:
    • The questions included standard questions to evaluate general knowledge and understanding, and challenging scenario-based questions crafted to simulate real-world situations and test critical thinking skills.
    • The questions underwent multiple rounds of human annotation. This included generating the ground truth answers first and then creating multiple-choice questions (MCQs) based on the ground truth. The MCQs were crafted with the following principles:
      • Options were highly similar to the ground truth, differing only in specific keywords or subtle details to make them incorrect.
      • On average, MCQs were longer than the ground truth answers to test model precision.
      • Some questions included multiple correct answers for added complexity.
  3. Quality Assurance: The generated questions were thoroughly reviewed for similarity and uniqueness. Any redundant or overly similar questions were removed. Additionally, each batch of questions underwent sampling and human verification to ensure accuracy and relevance.
  4. Evaluation on LLMs: The curated dataset was used to evaluate various large language models across different scales, focusing particularly on those optimized for mobile usage. Evaluation metrics included latency, accuracy, and energy efficiency to ensure the benchmarks were practical for mobile environments.

This meticulous process ensures that our benchmarks are not only comprehensive but also reflective of real-world mobile usage scenarios.

Benchmark Results

Discover the latest results from our interactive visualizations, comparing LLMs on performance, accuracy, and efficiency. Dive deep into the metrics and make informed decisions about the future of mobile intelligence.

  • From the results we can see interesting observations. Firstly, it is higher variance in performance across models on the Mobile-MMLU benchmark compared to MMLU and MMLU-Pro. This increased variance is particularly valuable as it allows for a clearer distinction between model capabilities, especially for smaller-scale models (1-3B parameters), which are the primary focus of this benchmark. For example, Qwen2.5-3B, Phi-3.5-mini, and Llama-3.2-3B, all roughly same size, exhibit significant differences in their results. Another notable point is that strong performance on MMLU or MMLU-Pro does not guarantee comparable results on Mobile-MMLU. For instance, the Phi-3.5-mini model performs impressively on MMLU and MMLU-Pro but falls short on Mobile-MMLU. Conversely, the Qwen2.5-3B model exhibits relatively modest results on MMLU and MMLU-Pro but excels on Mobile-MMLU, even surpassing some of the 8B models in this benchmark.

    Dataset Comparison Overview
    Table:Performance comparison of models on 3 different benchmarks
  • We conduct hardware tests using the llama.cpp test framework, specifically modifying the official SwiftUI example from llama.cpp. These tests are performed on iOS to evaluate performance, with all LLMs converted to the GGUF format and quantized using the Q4_K_M method. This quantization technique is recognized for its efficiency in mobile-optimized models. All tests use Apple's Metal API as the backend, ensuring a consistent runtime environment across devices. Also, we record the on-device size of the GGUF files and parameter counts. It is worth noting that Q4_K_M-quantized models include additional quantization parameters, such as scaling factors, which result in a higher parameter count compared to the original models.

    We conduct two types of inference tests: the Prefilling 512 test and the Text Generation 128 test. The Prefilling 512 test measures performance during the initial phase where 512 tokens are processed. The Text Generation 128 test evaluates performance during the text generation phase, where 128 tokens are generated. Each benchmark records performance metrics such as token throughput (measured in tokens per second) and peak memory usage (maximum RAM utilization).

    iphone results
    Table: Performance of Apple iPhone 14
  • Below is a detailed heatmap visualization showing the performance of each model across different categories. The color intensity represents the accuracy percentage, with darker red indicating lower performance and darker blue indicating higher performance.


    Dataset Comparison Overview

    Table:Performance comparison of different models across various categories of our benchmark

Citation: Our full paper will be released soon, please cite this work if you find it helpful.

                    @misc{mobilemmlu2024,
                        title={Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark}, 
                        author={Sondos Mahmoud Bsharat, Mukul Ranjan, Aidar Myrzakhan, Jiacheng Liu, Bowei Guo, Shengkun Tang, Zhuang Liu, Yuanzhi Li, Zhiqiang Shen},
                        url={https://github.com/VILA-Lab/Mobile-MMLU},
                        note={Also available at \url{https://huggingface.co/spaces/MBZUAI-LLM/Mobile-MMLU}},
                        year={2024}
                      }