AWS has added a new feature, dubbed cross-region inferencing, to its Amazon Bedrock generative AI service in order to help developers automate inference routing requests coming to the service during traffic spikes in AI workloads.

Cross-region inferencing, which has been made generally available and comes as a no-cost option for developers using on-demand mode inside Bedrock, dynamically routes traffic across multiple regions in order to preserve optimal availability for each request from applications powered by Amazon Bedrock and better performance during high-usage periods.

The on-demand mode inside Bedrock allows developers to pay for what they use with no long-term commitments as opposed to batch mode, wherein developers provide a set of prompts as a single input file and receive responses as a single output file, allowing them to get simultaneous large-scale predictions. 

“By opting in, developers no longer have to spend time and effort predicting demand fluctuations,” the company wrote in a blog post.

“Moreover, this capability prioritizes the connected Amazon Bedrock API source/primary region when possible, helping to minimize latency and improve responsiveness. As a result, customers can enhance their applications’ reliability, performance, and efficiency,” it added.

Developers can start using cross-inferencing by either APIs or the Bedrock AWS console to define the primary region and the set of secondary regions where the requests will flow in case of traffic spikes.

As part of the launch of this feature, developers will have the choice to select a US-based model or an EU-based model, each of which will include two to three preset regions from these geographic locations.

Currently, models available for cross-inferencing include Claude 3.5 Sonnet, Claude 3 family of large language models (LLMs) — Haiku, Sonnet, and Opus.

Latency comes as a caveat

AWS has pointed out that the feature will try to serve any request from the primary region first before moving to any secondary region and this will result in additional latency incurred when rerouting happens.

“…in our testing, it has been a double-digit milliseconds latency add,” the company wrote.

For clarity, developers and enterprises have to pay the same price per token of individual models as listed against their primary or source region for cross-inferencing.

For this feature, AWS said it would not charge enterprise users for data transfer, encryption, network usage, and potential differences in price per million tokens per model.

The cloud services provider further pointed out that enterprises need to be careful about their data residency and privacy requirements.

“Although none of the customer data is stored in either the primary or secondary region(s) when using cross-region inference, it’s important to consider that inference data will be processed and transmitted beyond the primary region,” the company wrote.

Other than AWS, Snowflake seems to be the only other LLM service provider that has introduced cross-region inferencing.

Earlier this month, Snowflake made the feature available as part of its AI and ML features.

On the other hand, rival cloud service providers, such as Google Cloud and Microsoft, offer similar features across their database, infrastructure, and older machine learning services.

While Google Cloud offers similar features across its other services such as Cloud Run and BigQuery, Microsoft offers inferencing endpoints as a serverless option via its Azure Machine Learning service.

As part of BigQuery, Google Cloud allows cross-region dataset replication. Similarly, Azure too offers enterprises the option to replicate data across cloud regions.