AWS-Designed Inferentia Chips Boost Alexa Performance
Almost two years after unveiling its Inferentia high-performance machine learning inference chips, Amazon has almost completed a migration of the bulk of its Alexa text-to-speech ML inference workloads to Inf1 instances on its AWS EC2 platform. The dramatic move to infrastructure powered by the newer chips was made to gain significant cost savings and performance gains over the GPU-based instances on EC2 that were formerly being used, according to the company.
By migrating the workloads to Inf1 instances, the Alexa text-to-speech workloads experience 25% lower end-to-end latency, while cutting operating costs by 30% compared to the previously used GPU-based instances, Amazon announced on Thursday (Nov. 12). By reducing the latency, Alexa engineers are also able to use more complex algorithms to improve the Alexa experience for users.
The Amazon Inferentia chip design was unveiled at the AWS re:Invent conference in November 2018, while Inf1 inferences followed a year later with an announcement at re:Invent in 2019.
The Inferentia chips allow Inf1 instances to provide up to 30% higher throughput and up to 45% lower cost per inference compared to GPU-based G4 instances, which previously were the most economical AWS cloud instances for ML inference, according to Amazon.
“While training rightfully receives a lot of attention, inference actually accounts for the majority of complexity and the cost – for every dollar spent on training, up to nine are spent on inference – of running machine learning in production, which can limit much broader usage and stall innovation,” an AWS spokesperson told EnterpriseAI. “EC2 instances – virtual servers running in the AWS cloud – offer a broad range of instance types powered by a variety of processors to meet specific workload requirements. AWS offers EC2 instances based on the fastest processors from Intel, cost optimized instances with AMD processors, the most powerful GPU instances from Nvidia, as well as instances based on AWS Inferentia chips, a high performance machine learning inference chip designed by AWS.”
While the lower latency benefits Alexa engineers in their text-to-speech work, customers across a diverse set of industries are seeing similar benefits as they turn to machine learning to address common use cases such as personalized shopping recommendations, fraud detection in financial transactions, increasing customer engagement with chat bots and more, the AWS spokesperson said. “Many of these customers are evolving their use of machine learning from running experiments to scaling up production machine learning workloads where performance and efficiency really matter. Customers want high performance for their machine learning applications in order to deliver the best possible end user experience.”
That’s where the Inf1 instances provide answers, the spokesperson added. “With Inf1 instances, customers can run large scale machine learning inference to perform tasks like image recognition, speech recognition, natural language processing, personalization, and fraud detection at the lowest cost in the cloud.”
Inferencing is the science and practice of making predictions using a trained ML model. The Inferentia chips were developed by Israel-based Annapurna Labs, which was acquired by Amazon in 2015. Each Inferentia chip provides hundreds of TOPS (tera operations per second) of inference throughput. Hundreds of TOPS can be included in each chip and the chips can be banded together to harness thousands of TOPS.
The Inferentia chips are designed for larger workloads that consume entire GPUs or require lower latency. The chips provide hundreds of teraflops each and thousands of teraflops per Amazon EC2 instance for multiple frameworks, including TensorFlow, Apache MXNet, and PyTorch. They also are built for multiple data types, including INT-8 and mixed precision FP-16 and bfloat16. The Inferentia chips complement AWS’s Amazon Elastic Inference, which allows customers to avoid running workloads on whole Amazon EC2 P2 or P3 instances where there is relatively low utilization. Instead, developers can run on a smaller, general-purpose Amazon EC2 instance and use Elastic Inference to provision the right amount of GPU performance, resulting in up to 75% cost savings.
Rob Enderle, principal research analyst at Enderle Group, told EnterpriseAI that Inf1 instances could similarly help other enterprise customers as well. “It is a more focused part so think better, faster, and cheaper,” said Enderle. “Focused inference parts generally scale better and have lower overhead, depending on loading. The related digital assistant should be more responsive and less likely to get the command or question wrong.”
Amazon Alexa’s cloud-based voice service powers Amazon Echo devices which are connected to more than 100 million devices, according to Amazon. Alexa has three main inference workloads – Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), and Text-to-Speech (TTS) – with TTS workloads initially running on GPU-based instances. As a user talks to a speech-enabled computer or voice-based service such as Alexa, the ASR phase converts the audio signal into text, the NLU stage interprets the question and generates a smart response, and then the TTS phase converts the text into speech signals to generate audio for the user. The Alexa team decided to move to the Inf1 instances as fast as possible to improve the customer experience and reduce the service compute cost, according to the company.