Source: Arm Author: Arm
With the evolution of artificial intelligence (AI), the use of small language models (SLM) to execute AI workloads on embedded devices has become the focus of the industry. Small language models such as Llama, Gemma, and Phi3 have gained widespread recognition for their cost-effectiveness, high efficiency, and ease of deployment on computation-constrained devices. Arm expects the number of such models to continue to grow through 2025.
Arm technology, with its significant advantages of high performance and low power consumption, provides an ideal operating environment for small language models, which can effectively improve operating efficiency and further optimize user experience. To visualize the potential of endpoint AI in iot and edge computing, the Arm technology team recently created a technology demonstration. In the demonstration, when the user enters a sentence, the system generates a children's story based on the expansion of that sentence. The demonstration was inspired by Microsoft's "Tiny Stories" paper and Andrej Karpathy's TinyLlama2 project, which used 21 million stories to train small language models to generate text.
The demo is powered by the Arm Ethos-U85 NPU and runs a small language model on embedded hardware. Although large language models (LLMS) are more widely known, small language models are gaining attention because they can provide excellent performance with fewer resources and lower cost, and are also easier and less expensive to train.
Implement small language model based on Transformer on embedded hardware
Arm's presentation demonstrated Ethos-U85's ability to run generative AI as a small, low-power platform and highlighted how well small language models can perform in specific domains. The TinyLlama2 model is more simplified than the larger models of Meta and other companies, and is suitable for displaying the AI performance of Ethos-U85, making it an ideal choice for endpoint AI workloads.
To develop this demonstration, Arm did a lot of modeling, including creating an all-integer INT8 (and INT8x16) TinyLlama2 model and converting it to a fixed-shape TensorFlow Lite format suitable for Ethos-U85 limitations.
The quantization method of Arm shows that the all-integer language model achieves a good balance between high accuracy and output quality. By quantizing activation, normalizing functions, and multiplying matrices, Arm eliminates the need for floating-point operations. Due to the high cost of floating point arithmetic in terms of chip area and power consumption, this is a key consideration for resource-constrained embedded devices.
Ethos-U85 runs the language model on an FPGA platform at 32 MHz, and its text generation speed can reach 7.5 to 8 tokens per second, which is comparable to human reading speed, while consuming only a quarter of the computing resources. On real-world system-on-chip (SoC) applications, this performance can be improved by up to ten times, significantly improving the processing speed and energy efficiency of edge-side AI.
The children's story generation feature uses an open source version of Llama2, combined with an Ethos NPU backend, to run demos on TFLite Micro. Most of the reasoning logic is written in C++ at the application level, and by optimizing the context window content, it improves the coherence of the story, ensuring that the AI can tell the story smoothly.
Due to hardware limitations, the team needed to adapt the Llama2 model to ensure it ran efficiently on the Ethos-U85 NPU, which required careful consideration of performance and accuracy. INT8 and INT16 hybrid quantization technologies demonstrate the potential of all-integer models, enabling the AI community to more actively optimize generative models for edge-side devices and driving the widespread adoption of neural networks on energy-efficient platforms such as Ethos-U85.
The Arm EthO-U85 delivers superior performance
The Ethos-U85's multiplicative accumulation (MAC) unit can be expanded from 128 to 2,048 units, which improves energy efficiency by 20% compared to the previous generation Ethos-U65. In addition, a notable feature of Ethos-U85 compared to the previous generation is its native support for Transformer networks.
Ethos-U85 enables partners using the previous Ethos-U NPU to seamlessly migrate and leverage their existing investments in ARM-based machine learning (ML) tools. Due to its excellent energy efficiency and excellent performance, Ethos-U85 is becoming increasingly popular with developers.
With 2,048 MAC configurations on the chip, Ethos-U85 can achieve 4 TOPS performance. In the demonstration, Arm used a smaller configuration of 512 Macs on an FPGA platform and ran a TinyLlama2 small language model with 15 million parameters at 32 MHz.
This capability highlights the possibility of embedding AI directly into devices. Despite its limited memory (320 KB SRAM for caching and 32 MB for storage), Ethos-U85 can efficiently handle such workloads, laying the foundation for the widespread use of small language models and other AI applications in deeply embedded systems.
Bringing generative AI to embedded devices
Developers need more advanced tools to deal with the complexity of edge-side AI. Arm is committed to meeting this need by launching Ethos-U85 and supporting Transformer-based models. As edge-side AI grows in importance in embedded applications, Ethos-U85 is driving the implementation of new use cases ranging from language models to advanced visual tasks.
Ethos-U85 NPU delivers the superior performance and energy efficiency required for innovative cutting-edge solutions. Arm's demonstration shows important progress in bringing generative AI to embedded devices and highlights the ease and feasibility of deploying small language models on the Arm platform.
Arm is opening up new opportunities for edge side AI across a wide range of applications, making Ethos-U85 a key driver of a new generation of smart, low-power devices.
免责声明: 本文章转自其它平台,并不代表本站观点及立场。若有侵权或异议,请联系我们删除。谢谢! Disclaimer: This article is reproduced from other platforms and does not represent the views or positions of this website. If there is any infringement or objection, please contact us to delete it. thank you! |