UHG
Search
Close this search box.

AI4Bharat Releases ‘FBI’ Framework to Evaluate LLM Benchmarks

LLMs are used for evaluating outputs of other LLMs which can influence leaderboards. Finding Blind Spots will help evaluate these evaluators.

Share

FBI is Here to Evaluate LLMs and Benchmarks

A recent research paper “Finding Blind Spots in Evaluator LLMs with Interpretable Checklists” was released by the Indian Institute of Madras and AI4Bharat, an initiative for spearheading AI research in India. The paper reveals significant flaws in the current methods used by LLMs to evaluate text generation tasks. 

Authored by researchers Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, and Mitesh M Khapra, FBI is a novel framework that is designed to assess how well Evaluator LLMs can gauge four critical abilities in other LLMs: factual accuracy, adherence to instructions, coherence in long-form writing, and reasoning proficiency.

The study involved introducing targeted alterations in answers generated by LLMs that impact these key capabilities, aiming to determine if Evaluator LLMs could detect drops in quality. A total of 2400 modified answers spanning 22 perturbation categories were created for the comprehensive study. Different evaluation strategies were applied to five prominent Evaluator LLMs frequently referenced in the literature.

Source: arxiv.org

Findings from the research revealed significant deficiencies in current Evaluator LLMs, which failed to identify declines in quality in over 50% of cases on average. Single-answer and pairwise evaluations exhibited notable limitations, while evaluations based on references demonstrated relatively better performance.

The study underscores the unreliable nature of current Evaluator LLMs and emphasises the necessity for cautious implementation in evaluating text generation capabilities. It is to be noted that Evaluator LLMs consistently missed basic errors, such as spelling and grammar mistakes. 

Way Forward

Systems that require high-stakes decision-making, the reliability of their evaluations must be scrutinised. The study underscores the need for improved evaluation strategies and the potential risks of over-reliance on current LLM evaluators.

The FBI framework offers a path forward by providing a more interpretable and comprehensive method for testing evaluator capabilities. By revealing the prevalent failure modes and blind spots of existing models, this framework can guide the development of more robust and reliable AI evaluators.

📣 Want to advertise in AIM? Book here

Picture of Vandana Nair

Vandana Nair

As a rare blend of engineering, MBA, and journalism degree, Vandana Nair brings a unique combination of technical know-how, business acumen, and storytelling skills to the table. Her insatiable curiosity for all things startups, businesses, and AI technologies ensures that there's always a fresh and insightful perspective to her reporting.
Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.