Facebook parent Meta’s AI research team is working on developing what it calls a Self-Taught Evaluator for large language models (LLMs) that could help enterprises reduce their time and human resource requirements while developing custom LLMs.
Earlier this month, the social media giant’s AI research team, dubbed Meta FAIR, published a paper on the technology, which claims that these evaluators could help an LLM create its own training data — synthetic data — for evaluation purposes.
Typically, models that are used as evaluators, known as LLM-as-a-Judge, are trained with large amounts of data annotated by humans, which is a costly affair, and the data becomes stale as the model improves, the researchers explained in the paper.
Human annotation of data is required or preferred over LLM responses, as the latter still cannot always successfully resolve challenging tasks such as coding or mathematics problems, the researchers further said, adding that this dependency on human-generated data poses significant challenges for scaling to new tasks or evaluation criteria.
The researchers used only synthetic data generated by an LLM in an iterative manner, without the need for labeling instructions.
“Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions,” the researchers wrote.
Explaining further, the researchers said they started with a seed model and used prompt engineering to generate contrasting synthetic preference pairs for a given input, such that one response is designed to be inferior to the other.
After that, the researchers used the model as an LLM-as-a-Judge to generate reasoning traces and judgments for these pairs, which they could label as correct or not given the synthetic preference pair design.
“After training on this labeled data, we obtain a superior LLM-as-a-Judge, from which we can then iterate the whole process in order for it to self-improve,” they wrote.
As part of their experiments, the researchers at Meta claimed in the paper that without any labeled preference data, the Self-Taught Evaluator improved Llama3-70B-Instruct’s score on the RewardBench benchmarking tool from 75.4 to 88.3.
The change in score, they said, outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.
However, the researchers also pointed out that the approach has some limitations. They did not test it on smaller models (their test models had 70 billion parameters) and did not consider any computational requirement concerns, only accuracy.
“Generative LLM-as-a-Judge models usually have longer outputs and thus higher inference cost than reward models that simply output a score, as LLM-as-a-Judge typically first generates a reasoning chain,” they wrote.
Additionally, they pointed out that since they used a seed model to generate the first synthetic preferences during their iterative training scheme, they assumed that the model was capable of generating reasonable evaluations.
“Thus, our approach is limited by having a capable instruction fine-tuned model which is already reasonably aligned to human (or legal/policy) preferences,” they explained.