Public vs Private benchmarks for LLMs

by Mike Riess

One of the things that separates LLMs from previous types of AI, is the notion of the foundation model, which can be adapted from the input, without having to re-train it. This is also known as in-context learning, and arguably oneof the most powerful capabilities of LLMs.
However, the performance still differs across models, and models gets better and better over time, hence companies need to keep track and proactively select the model that performs the best on their relevant KPIs.
Until recently, the standard approach has been to follow the public benchmark leaderboards such as Chatbot Arena, EuroEval (formerly ScandEval) and many others. These leaderboards are great for tracking general language, knowledge or even academic capabilities, yet the tasks and hence test datasetsdoes not necessarily represent the same tasks or datasets relevant to the company.
In industry, it thereby seems as we are taking a cautious step back towards MLOps and all the lessons learned in this area before Generative AI; problems likedata leakage, poor data quality and management, lack of model maintenance - and what some of us forgot with Generative AI - model evaluation from a representative distribution.
Of course, most benchmarks dorepresent the task they evaluate very well – but a Telco Operator for instance, is not interested in how engaging the LLM is to chat with, how well it summarizes news articles, does general Q&A or understands math. They want to know how well it performs in process XYZ, and what it costs to run it at scale.
Public benchmarks are great for the academic community which requires maximal transparency, yet in industry we cannot always share our data for business, legal or ethical reasons. The only way forward here is therefore to create our own private benchmarks, accurately representing the business KPIs we want to optimize in our processes.
A potential downside with public benchmarks is the risk of data leakage: 1) when the benchmark creators did not realize the data in the benchmark task resides in either of the large open source text corpuses, and this data is now unintentionally in both the train and the test set2) when model developers carelessly use every open dataset they can get their hands on to pre-train or fine-tune their model.
On the other side, the notion of private benchmarks can be criticized for closedness and being unwilling to contribute to the open source and/or scientific communities. This is a problem that is nontrivial to solve if the scientific community is to work on problems that are experienced in industry, for the common good. Fortunately, this is an area that has caught traction, and this recent research from Microsoft presents the TRUCE framework which attempts to solve this issue.
In Telenor GCTO R&I this is something we also focus on, and we look forward to sharing our work with you in the future.