Large language models (LLMs) like the OpenAI models used by Azure are general-purpose tools for building many different types of generative AI-powered applications, from chatbots to agent-powered workflows. Much of the work needed to get the most out of these off-the-shelf tools requires prompt engineering, or crafting the prompts used to structure responses.

However, there are limits to prompt engineering. You need to deliver the same prompt every time, along with user requests and all their associated data. This can be a significant issue if you’re pushing the maximum size of a model’s context window, and it can cost you more each time, as services like Azure OpenAI are paid for on a per-token basis. At the same time, complex requests can add latency to operations.

What’s needed is a way to focus the underlying model and tune it to work with your data. This is where Azure AI Foundry comes in. Azure AI Foundry offers a framework that helps you fine-tune big, complex models with Low-rank Adaptation (LoRA). By using this technique to adjust model parameters, when you run AI inference workloads, higher quality results can be delivered using fewer tokens, with less risk of prompt overruns and other issues that can cause incorrect or incoherent results.

Fine-tuning AI in Azure

Microsoft recently announced Azure AI Foundry as a way to manage and consume Azure-hosted AI models. As well as providing tools for testing and experimenting with models, it’s where you can tune models ready for use in your code. It offers two ways to work with models. First, the Hub/Project view works with models from across its library of providers, including Meta’s Llama and various Hugging Face models. Alternatively, you can use the Azure OpenAI tools, which add alternate tuning mechanisms but only work with OpenAI models.

Getting started with the Azure OpenAI fine-tuning requires a supported Azure OpenAI model in a region that allows fine-tuning. You will need an account with the Cognitive Services OpenAI Contributor role to upload training data. Many of the Azure OpenAI models support fine-tuning, including GPT 3.5 and GPT 4o. Region support is limited; your best chance is North Central US or Sweden Central. It’s important to be aware of the maximum number of tokens available for fine-tuning, as this can differ between models.

The basic workflow of a fine-tuning session is easy to understand. You’ll first need to source and prepare both your training and validation data before using the Create custom model wizard in Azure AI Foundry. The wizard will walk you through the basic steps of fine-tuning, uploading data, and setting task parameters before running a training session.

Formatting your training data

Getting your training data in the right formats is likely to be your biggest issue. Different models require different types of training data. For a GPT 3.5 or 4o model, your data needs to be JSONL formatted for the Azure OpenAI Chat Completions API.

If you’ve not come across JSONL before, it’s a variant of JSON where data is delimited by the new-line character. That makes it possible to build sample JSONL documents containing chat dialogs in your choice of text editor, with each line in the file containing user content, a prompt, and the expected output. A single line of JSONL may have more than one request and response, allowing you to deliver a multi-turn chat session as training data. Training data can include URLs that link to images and other multimedia content to fine-tune multimodal AI models.

If you need help creating your JSONL data, you can use the OpenAI CLI. This installs as a Python tool using pip. Once you’ve installed this, use the openai tools fine_tunes.prepare_data -f command with a local file to build a JSONL-formatted file from CSV, Excel, or JSON data. Despite not being integrated with the Azure Portal and Azure AI Foundry, it can simplify the process of preparing training data.

The more training data you have, the better. The system supports as few as 10 examples, but in practice, this is far too little. If you think you have enough training data, you’re probably wrong; Microsoft recommends having thousands of examples to effectively tune your data. That’s why tools like OpenAI’s preparation script are critical to the process; they let you generate fine-tuning training sets from large data sets like those in Microsoft Fabric or in Dataverse, using Excel’s query tools to prepare and format inputs and outputs, ready for conversion.

It’s important to have the highest-quality examples in your training set. Poor-quality data can negatively train the model, reduce accuracy, and increase errors. Building a high-quality, clean data set will take time and will require both data science and subject matter expertise to construct the necessary prompts and their expected answers.

Fine-tuning with Azure AI Foundry tools

Once you have all the prerequisites in place you can use the Create custom model tool in Azure AI Foundry to start the fine-tuning process. This can be found in the Fine-tuning section of the portal, and it will walk you through the process of tuning a base model. You should have already chosen the model you’re planning to tune, so select it from the list of available models.

Next, you need to select the training data. This can be data already stored in Azure AI Foundry or it can be uploaded as part of the process. Existing data will be stored in the Azure OpenAI Connection. Data can be imported from local files or from Azure Storage (which is likely if you’ve used Fabric to build your training data). It’s a good idea to upload on-premises training data to a Blob in advance of training, as it reduces the risk of upload issues when using Azure AI Studio forms.

The service offers the option of uploading validation data, which is formatted in a similar manner to the JSONL training data. Validation data can be useful but it’s not necessary, and if you haven’t created a suitable data set you can skip this stage.

You can now add tuning parameters. These will be used by LoRA to define the training process, from batch sizes and learning rates to the number of cycles through the training data. Other parameters include controls for the amount of drift and the reproducibility of a training job. You can choose your own values or let the process run on its defaults. The defaults will vary from run to run, based on an analysis of your training data.

You’re now ready to start training your fine-tuned model. This is a batch process, and as it requires significant resources, your job may be queued for some time. Once accepted, a run can take several hours, especially if you are working with a large, complex model and a large training data set. Azure AI Foundry’s tools allow you to see the status of a fine-tuning job, showing results, events, and the hyperparameters used.

Each pass through the training data produces a checkpoint. This is a usable version of the model with the current state of tuning so you can evaluate them with your code before the fine-tuning job completes. You will always have access to the last three outputs so you can compare different versions before deploying your final choice.

Ensuring fine-tuned models are safe

Microsoft’s own AI safety rules apply to your fine-tuned model. It is not made public until you explicitly choose to publish it, with test and evaluation in private workspaces. At the same time, your training data stays private and is not stored alongside the model, reducing the risk of confidential data leaking through prompt attacks. Microsoft will scan training data before it’s used to ensure that it doesn’t have harmful content, and will abort a job before it runs if it finds unacceptable content.

The same AI safety tool is run over your fine-tuned model once a tuning cycle has completed. A chatbot uses various tailored prompts and other attacks to try to generate harmful outputs. If your model fails, it won’t be made available, even in an evaluation sandbox.

Using fine-tuned models

Once a model has been tuned and tested, you can find it in your Azure AI Foundry portal, ready for deployment as a standard AI endpoint, using familiar Azure AI APIs and SDKs. A tuned model can only be deployed once, and if it’s not used for 15 days it will be removed and will need to be redeployed. Deployed models can run in any region that supports fine-tuning, not only the one used to train the model.

Usefully Microsoft supports the option of continuous fine-tuning, treating your existing tunings as a base model and then running the same process using new training data, perhaps based on user prompts and expected outputs rather than the actual responses generated by those inputs.

More open-ended use cases can take advantage of the preview of a new technique: Direct preference optimization. DPO uses human preferences to manage tuning, with training data provided as sample conversations with “preferred” and “non-preferred” outputs. These can be based on earlier user conversations from your logs, where outputs were not what you wanted to present to users.

Costs for fine-tuning a model vary depending on the region and the model. If you’re tuning a GPT-4o model in North Central US, expect to pay $27.50 for 1 million training tokens and $1.70 per hour to host the model once training is complete. A token is roughly equivalent to a syllable, so each word in your training set will cost an average of 2 to 3 tokens. Once the model is deployed, inferencing is priced at $2.75 per million input tokens and $11 per million output tokens. It’s worth working through a cost/benefit analysis of using a tuned model, though it can be hard to put a number on the costs associated with errors or reputational damage.

Fine-tuning in Azure AI Foundry helps you get the best output from models that have been trained on general data, allowing you to focus them on specific responses. It’s not a complete fix for unexpected outputs, but when used alongside retrieval-augmented generation (RAG) and other techniques for grounding LLM operations, it should significantly reduce risk and increase accuracy. Without training an LLM on your own custom data set, along with all the compute and resource requirements that come with that, it’s as probably as good as we’ll get for now.