Modern AI is about a lot more than chatbots, as shown by Microsoft’s Ignite 2024 pivot to using its stable of large and small language models to power autonomous agents. Much of its focus was on using productivity tools and software-generated events to trigger AI-orchestrated workflows, but the company touched on the importance of multimodal inputs as a way of extending modern AIs beyond keyboard and voice inputs, out into the wider world.

It wasn’t a surprising move. Microsoft’s original in-house Azure Cognitive Services was built around a series of models that focused on both computer vision and audio processing. It even used them as the basis of its Azure Percept industrial AI sensor hardware and to deliver AI-ready camera hardware for developers.

Understand the world with AI

Much of Cognitive Services is intended to provide AI-powered understanding of the world, using computer vision to categorize objects from images, video, and audio to isolate significant events. The tools also support speech recognition and transcription, alongside optical character recognition, making multimedia content computer-readable. Simple APIs offered RESTful asynchronous connections to service endpoints, along with the necessary tools to customize and tune models.

You can obviously use Cognitive Services as part of an agentic workflow, as an input to a framework like Semantic Kernel, or wrapped as a Copilot Studio connector, and there’s now a new single service that builds on new models to support content inputs from documents, images, video, and audio. There’s no need to build new prompts, as the new Azure AI Content Understanding service comes ready for use.

Add multimodal input processing to agent workflows

Azure AI Content Understanding gives you one place to process diverse inputs and delivers output in a standard format that’s ready for an agent’s workflow. Outputs can give your application an understanding of user intent, with data supported by a strongly typed schema that makes it easier to quickly get data in a format ready for your code.

Perhaps the biggest value of a tool like this is its ability to take unstructured data and convert it into structured, strongly typed information, with additional insights that help you take the most advantage of the data; for example, when processing a conversation or a meeting, content is broken up into logical sections and tagged by speaker.

Under the hood, the public preview of Azure AI Content Understanding is built on top of a set of generative AI models, with a multimodal input and a set of tools that let you define its output. This uses a template model that works in Azure AI Foundry. You can use prebuilt templates from Microsoft or build your own from scratch. The service selects the right model for your input automatically, and the service outputs structured content ready for use in an agent workflow.

An agentic AI workflow can use the service at any point in its operation. It can ingest a meeting recording, analyze its content, and then trigger actions across Microsoft 365, putting summaries and the transcript in Microsoft SharePoint, extracting action items and adding them to individual and team calendars, even updating deliverables in Microsoft Project. Work that may have taken team members hours can be automated, allowing them to concentrate on project tasks rather than administration.

JSON documents and REST calls

Like most Azure products, getting started is relatively simple; the Azure AI Content Understanding service is part of an Azure AI Services resource. This allows you to use multiple services with the same credentials, helping keep track of billing and simplifying key and token management.

At the heart of the Azure AI Content Understanding service are analyzer templates. These are JSON documents that describe and structure the data you want to get from your inputs, for example, defining the expected fields in common business documents and ensuring they’re correctly typed. If you’re defining a template for analyzing a document, you need to define the contents of the document you want to extract. For an invoice, you’d include the vendor, the invoice number, a list of items and prices, and a total.

There’s no need to work with sample documents to label fields to be extracted; the underlying model has been trained on a lot of different document types. Once your analyzer template has been uploaded, all you need to do is deliver your content and then parse the JSON response.

Building an analyzer is something you can do yourself, using tools like curl to upload the analyzer and constructing the HTTP request objects in Postman. However, it’s a lot easier to work with the tools built into Azure AI Foundry. One necessary feature is still missing: a cross-language SDK. If you’re writing code to work with an Azure AI Content Understanding endpoint, you will need to be familiar with building and managing REST calls, wrapping them in your own methods.

Building a content analyzer in Azure AI Foundry

Here you start with a sample of the content you want to analyze. Upload your sample to Azure AI Foundry, and the service will suggest templates from its own library based on your document. Choose the most appropriate and edit it to add your own fields and types. It’s a good idea to add descriptions to your edited schema to help with debugging and to support other developers. Once you’ve saved your customized schema, you can test the analyzer against a selection of sample documents. Once saved, the Azure AI Foundry tool builds your analyzer, ready for use. This will generate endpoint URLs to add to your code.

The sample templates are split across the four content categories: text, image, audio, and video. Some, like retail inventory management or media asset management, are industry-specific, and Microsoft will likely add more as different use cases emerge. If you’ve used any of the Azure Cognitive Services in the past, you should find this a lot easier to use, with support for more complex documents and other content.

Each analyzer is a pipeline in its own right, processing inputs, extracting content, and then providing insights as well as application-ready information. There’s more to the process than basic recognition, and the document analyzer add-on tools offer more features, including the ability to recognize and process barcodes and mathematical formulas in documents. The service will process handwritten content as well as type.

Microsoft provides a detailed list of supported document formats and file types, as well as limits to how much data can be processed. If you’re uploading video or audio, you can only process up to four hours at a time, and there’s a limit of 1,000 pages or images for document and image analysis.

Other limits are enforced by specific file type, whether you’re including the file to be analyzed in a request, or if you’re simply providing a URL of where the file is stored. This last option is likely best for most cases, especially if data is held in Azure Blobs, where you can provide the blob address and keep storage and data transfer costs to a minimum.

Autonomous AI applications need high-quality input data

With services like Azure AI Content Understanding and Azure AI Search, Microsoft is providing the basic framework for building complex AI applications quickly. Here, you’re generating high-quality input data from unstructured, unlabeled content and using it in conjunction with well-defined, searchable grounding data to reduce the risk of erroneous output.

Microsoft has designed Azure AI Content Understanding for use in autonomous systems. Results are tagged with confidence levels that control how an AI agent workflow processes, triggering alerts when it’s hard to identify content. Using additional templates designed to identify malicious or illegal content can provide a basic content-moderation tool as part of a consumer-facing service.

Being able to process and manage content is key to improving accuracy and reducing risk, important requirements for any agent-powered workflow. Providing strongly typed data at the start of a workflow, extracted from non-structured sources, will speed up operations and allow you to mix AI and conventional code, as well as provide inputs to a low-code Copilot Studio agent.

With Azure AI Content Understanding, Microsoft is using a new generation of multimodal AI models to boost familiar Cognitive Services functions. For now, as it’s in preview, the service is free, giving you the opportunity to learn how to take advantage of these new tools in your code.