New AI technique speeds up language models on edge devices

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!

Researchers at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and MIT-IBM Watson AI Lab recently proposed Hardware-Aware Transformers (HAT), an AI model training technique that incorporates Google’s Transformer architecture. They claim that HAT can achieve a 3 times inferencing speedup on devices like the Raspberry Pi 4 while reducing model size by 3.7 times compared with a baseline.

Google’s Transformer is widely used in natural language processing (and even some computer vision) tasks because of its cutting-edge performance. Nevertheless, Transformers remain challenging to deploy on edge devices because of their computation cost; on a Raspberry Pi, translating a sentence with only 30 words requires 13 gigaflops (1 billion floating-point operations per second) and takes 20 seconds. This obviously limits the architecture’s usefulness for developers and companies integrating language AI with mobile apps and services.

The researchers’ solution employs neural architecture search (NAS), a method for automating AI model design. HAT performs a search for edge device-optimized Transformers by first training a Transformer “supernet” — SuperTransformer — containing many sub-Transformers. These sub-Transformers are then trained simultaneously, such that the performance of one provides a relative performance approximation for different architectures trained from scratch. In the last step, HAT conducts an evolutionary search to find the best sub-Transformer, given a hardware latency constraint.

To test HAT’s efficiency, the coauthors conducted experiments on four machine translation tasks consisting of between 160,000 and 43 million pairs of training sentences. For each model, they measured the latency 300 times and removed the fastest and slowest 10% before taking the average of the remaining 80%, which they ran on a Raspberry Pi 4, an Intel Xeon E2-2640, and an Nvidia Titan XP graphics card.

According to the team, the models identified through HAT not only achieved lower latency across all hardware than a conventionally trained Transformer, but scored higher on the popular BLEU language benchmark after 184 to 200 hours of training on a single Nvidia V100 graphics card. Compared to Google’s recently proposed Evolved Transformer, one model was 3.6 times smaller with a whopping 12,041 times lower computation cost and no performance loss.

“To enable low-latency inference on resource-constrained hardware platforms, we propose to design [HAT] with neural architecture search,” the coauthors wrote, noting that HAT is available in open source on GitHub. “We hope HAT can open up an avenue towards efficient Transformer deployments for real-world applications.”

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.

Share

New AI technique speeds up language models on edge devices

Transform 2022

Join forces with OHUB & VB to include & hire 1,000 BIPOC students at SXSW

Join the VentureBeat Community

Free: Join the VentureBeat Community for access to 3 premium posts and unlimited videos per month.

Sign up with your business e-mail to continue with ticket purchase

Share

Transform 2022

Join forces with OHUB & VB to include & hire 1,000 BIPOC students at SXSW