Google Cloud has introduced a new service inside its AI platform Vertex AI to help enterprise users ascertain whether a large language model (LLM) is suitable for a particular use case.

The service, which has been christened the generative AI evaluation service, is also effective in curtailing hallucinations, the company wrote in a blog post.

Hallucinations are faulty responses or outputs that an LLM is prone to generating basis an input when it grows in complexity and is not grounded in the data that it is being asked about.

Retrieval augmented generation (RAG), fine-tuning, and prompt engineering are a few ways to address hallucinations. RAG, for example, grounds an LLM by feeding the model facts from an external knowledge source or repository to improve the response to a particular query.

How does the gen AI evaluation service help enterprises?

The gen AI evaluation service, according to Nenshad Bardoliwalla, director of product management at Vertex AI, provides two key sets of functionalities for enterprises working on gen AI use cases — “Pointwise and Pairwise.”

“The Pointwise evaluation helps users to understand how well the model works for their specific use case,” Bardoliwalla said, adding that enterprise users can either provide a ground truth dataset that reflects the ideal outputs of the LLM, or they can use Gemini models to judge the quality of the outputs.

As part of the Pointwise evaluation functionality, Google Cloud offers a rapid as well as a pipeline mode option.

The rapid mode, according to the company, is meant to allow users to hone the quality of their prompts through a real-time, interactive workflow. It allows enterprise users to change their prompts and get a sense of what the effect of those changes are, it added.

The pipeline mode, alternatively, allows users to do a more thorough evaluation using a larger ground truth dataset or ask the autorater to look at many more examples, according to Bardoliwalla.

The autorater is a proprietary LLM, such as the latest Gemini or PaLM models, the company said.

Explaining autoraters further, Eric Johnson, director of the technology practice at West Monroe, said that LLMs used in autoraters assess models without needing ground truth, and these evaluations include confidence scores and explanations, making the evaluation process more helpful for enterprises.

The operating rationale of an autorater, according to Bob Sutor, practice lead of emerging technologies at The Futurum Group, is it attempts to mimic human evaluation of LLM results.

“In turn, Google uses people to tune the autorater to produce realistic results,” Sutor explained.

The pipeline mode can be used for foundation model selection, or as a step in prompt engineering, or as a step in a fine-tuning workflow, or as a final check before deploying an updated prompt, according to the product manager.

Comparing two models via the service’s Pairwise Evaluation

Pairwise Evaluation, on the other hand, is meant to help users compare two models against each other, Google wrote in the service’s technical documentation.

“We offer both an autorater-based approach and a ground-truth-based approach. Similar to Pointwise Evaluation, here we also offer a rapid and pipeline option to support different use cases. The pipeline mode inside Pairwise Evaluation has been branded as Auto SxS,” Bardoliwalla said.

In addition, the product manager said that for rapid evaluation, across both functionalities, enterprises can use Gemini 1.5 Pro as an autorater to evaluate all Google models as well as their tuned versions.

For pipeline mode, across both functionalities, the company supports both PaLM and Gemini as autoraters.

“Fine-tuned versions of Gemini and PaLM stored in the Vertex AI Model Registry can be compared as well,” the product manager added.  

These evaluations, according to analysts, can help enterprises avoid costly business errors.

One example of an untested or unevaluated model creating issues for an enterprise, according to Sutor, is Chevrolet’s gen AI chatbot that recommended a Ford pickup truck to one of its users.

Further, the company said that the objective of the gen AI evaluation service in the long-term is to support evaluation through the entire gen AI development lifecycle, ranging from foundation model selection to customization (prompt engineering, tuning, distillation, and CI/CD.

However, Bardoliwalla pointed out that the gen AI evaluation service should not be confused with the model evaluation service within Vertex AI.

The model evaluation service, according to the company, is targeted at helping enterprise users evaluate predictive AI models.

“This service provides a simple way for users to take their custom-trained predictive models (i.e. Classifiers, Regressions, etc.) and calculate quality metrics against a user-provided ‘ground truth’ dataset,” Bardoliwalla said.

A hot market for gen AI evaluation services

Technology giants are rapidly building tools to offer genAI evaluation services. Google rivals, especially rival hyperscalers, such as AWS and Microsoft offer similar tools for model evaluation as part of their gen AI and machine learning services, such as Amazon Bedrock and Azure AI Studio.

In April this year, AWS made the model evaluation capability inside Amazon Bedrock generally available.

The model evaluation capability allows enterprises to choose an automatic or human method to check for various metrics such as accuracy, robustness, toxicity, and any desired metric, such as adherence to brand voice.

While the automatic process is completed by the foundation models available inside the service, the human evaluation can either be conducted by an internal enterprise team or an AWS-managed team, the cloud service provider wrote in a blog post.

Enterprises also have the option to create and manage model evaluation jobs programmatically.

As part of its machine learning service, Amazon SageMaker, AWS offers two capabilities — SageMaker Model Monitor and SageMaker Clarify, which can be used for model and data drift monitoring and machine learning bias detection respectively.

As part of Clarify, AWS offers enterprises a feature, dubbed FMEval, which is an open-source LLM evaluation library to help data scientists and ML engineers evaluate LLMs before deciding to use it for a specific use case.

“FMEval provides the ability to perform evaluations for both LLM model endpoints or the endpoint for a generative AI service as a whole. FMEval helps in measuring evaluation dimensions such as accuracy, robustness, bias, toxicity, and factual knowledge for any LLM,” the cloud service provider wrote in a blog post.

Enterprises can use EMEval to evaluate LLMs hosted on either AWS or third-party platforms, such as ChatGPT, HuggingFace, and LangChain, it added. 

Microsoft, too, offers similar capabilities as part of its Azure AI Studio service. The service, according to the company, offers model benchmarking as a feature to allow enterprise users to test a model before using it for a specific use case based on metrics, such as accuracy.