Google’s Gemini, a multimodal large language model (LLM), has garnered significant attention since its introduction through various channels. Bard is now powered by Gemini Pro, and Pixel 8 Pro users will experience new features with the integration of Gemini Nano. The release of Gemini Ultra is scheduled for next year. Starting from December 13th, developers and enterprise customers can access Gemini Pro through Google Generative AI Studio or Vertex AI in Google Cloud.
However, a closer examination of the benchmarks presented on the official website reveals discrepancies and misleading comparisons. This blog will delve into one such benchmark, explain the underlying differences between the metrics used, and emphasize the need for responsible and transparent benchmark reporting.
The Gemini website boasts of its superiority to other models, citing benchmarks like MMLU. At a glance on the website, these numbers seem to paint a picture of clear dominance. But when we delve deeper into the technical report, a different story emerges. The website cherry-picks specific data points, conveniently omitting crucial details that significantly alter the picture. Let’s examine the specifics of the MMLU benchmark.
What is MMLU Benchmark?
MMLU Benchmark, or Massive Multi-task Language Understanding Benchmark, evaluates text models’ multitask accuracy in zero-shot and few-shot scenarios. Covering 57 tasks spanning elementary math, history, computer science, and law, MMLU tests models on diverse domains, requiring a broad knowledge base and adept problem-solving skills. Notable models like OpenAI GPT-4, Google Gemini, and Anthropic Claude 2 undergo comparison, establishing MMLU as a benchmark for evaluating language models.
As a standard for assessing generalization capabilities, MMLU aids researchers and developers in informed model selection for specific applications. Its value lies in the granularity and breadth of its dataset, offering a crucial tool to evaluate language models’ performance across diverse contexts. However, the quality of results depends on prompt engineering—a prompt, including instructions, questions, context, inputs, or examples, serves to better instruct the model and achieve improved outcomes.
Apples and Oranges: Understanding 5-Shot and CoT prompting method
Comparing 5-shot (Few Shot) and CoT is akin to comparing apples and oranges. While both metrics assess comprehension, they differ fundamentally in their focus:
Example of FewShot (5-Shot) Prompt
5-Shot (Few Shot, k=5)
This metric evaluates the model’s capacity to answer factual questions based on a limited context (5 prompts). It is a more specific case of Few Shot prompting technique. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.
Example for Chain-of-Thought Prompt
CoT (Chain of Thought)
In chain-of-thought prompting, the input question is followed by a series of intermediate natural language reasoning steps that lead to the final answer. Think of this as breaking down a complicated task into bite-sized, logical chunks. This approach has been found to significantly enhance the ability of LLMs to tackle complex arithmetic and commonsense reasoning tasks.
Unveiling the Discrepancies
The Gemini website initially presented a seemingly impressive comparison between Gemini Ultra and GPT-4 on the MMLU benchmark:
Gemini Ultra: 90.0% (CoT@32) vs. GPT-4: 86.4% (5-shot)
However, a deeper investigation revealed discrepancies in the technical report:
Gemini Ultra: 90.04% CoT@32 vs. GPT-4: 87.29% CoT@32
Gemini Ultra: 83.7% 5-shot vs. GPT-4: 86.4% 5-shot
Oh, what a surprise! GPT-4 conveniently outshines Gemini Ultra in the 5-shot method. But wait, there’s more! The website seems to have conveniently overlooked mentioning the astounding 87.29% metric achieved by GPT-4 using the same CoT@32 method. How unexpected!
These inconsistencies call into question the validity and transparency of the presented information.
Impact of the Misleading Comparison
Presenting CoT for Gemini Ultra while showcasing 5-shot for GPT-4 creates an unfair and misleading comparison. CoT, with its reliance on extensive reference summaries with reasoning steps, artificially inflates Gemini Ultra’s score. In contrast, 5-shot represents a more direct measure of factual comprehension and reasoning, making GPT-4’s performance genuinely impressive when compared on the same metric.
The Need for Transparency and Responsible Practices
Adding to this, Google aimed to showcase the capabilities of its multimodal LLM through a hands-on video featuring Gemini’s purported real-time responses to voice prompts. Despite the initial impressive demonstration, viewers eventually came across a disclaimer, revealing that latency had been reduced, and Gemini’s outputs were shortened for brevity.
These inconsistencies on the Gemini website raise serious concerns about Google DeepMind’s commitment to transparency and responsible AI development. Such practices can mislead users and create an inaccurate picture of the true capabilities of LLMs.
Conclusion: Building Trust Through Transparency
While the potential of the Gemini family of LLMs is undeniable, the discrepancies on its benchmarks raise questions regarding the accuracy and transparency of the presented information. Moving forward, prioritizing transparency and responsible benchmarking practices is essential to building trust and ensuring responsible development of AI technology. Only through open and honest communication can the LLM community ensure that its advancements benefit all of humanity.
Are you looking to build an AI Chatbot to simplify you user experience and optimize IT costs ? Are you looking to optimize your PostgreSQL databases ? Are you looking to migrate from commercial databases to PostgreSQL (Open Source) to eliminate license costs ? Contact us today to learn more about how we can help you. Submit the following form and one of our experts will contact you soon.