There is a clear battle underway among the major players in the PC market about the definition of what makes an AI PC. Itâs a battle that extends to how Microsoft and other OEMs interpret that definition as well. The reality is that an AI PC needs to be able to run AI workloads locally, whether thatâs using a CPU, GPU or neural processing unit. Microsoft has already introduced the Copilot key as part of its plans to combine GPUs, CPUs and NPUs with cloud-based functionality to enable Windows AI experiences.
The bigger reality is that AI developers and the PC industry at large cannot afford to run AI in the cloud in perpetuity. More to the point, local AI compute is necessary for sustainable growth. And while not all workloads are the same, the NPU has become a new and popular destination for many next-generation AI workloads.
What Is An NPU?
At its core, an NPU is a specialized accelerator for AI workloads. This means it is fundamentally different from a CPU or a GPU because it does not run the operating system or process graphics, but it can easily assist in doing both when those workloads are accelerated using neural networks. Neural networks are heavily dependent on matrix multiplication tasks, which means that most NPUs are designed to do matrix multiplication at extremely low power in a massively parallel way.
GPUs can do the same, which is one reason they are very popular for neural network tasks in the cloud today. However, GPUs can be very power-hungry in accomplishing this task, whereas NPUs have proven themselves to be much more power-efficient. In short, NPUs can perform selected AI tasks quickly, efficiently and for more sustained workloads.
The NPUâs Evolution
Some of the earliest efforts in building NPUs came from the world of neuromorphic computing, where many different companies tried to build processors based on the architecture of the human brain and nervous system. However, most of those efforts never panned out, and many were pruned out of existence. Other efforts were born out of the evolution of digital signal processors, which were originally created to convert analog signals such as sound into digital signals. Companies including Xilinx (now part of AMD) and Qualcomm both took this approach, repurposing some or all of their DSPs into AI engines. Ironically, Qualcomm already had an NPU in 2013 called the Zeroth, which was about a decade too early. I wrote about its transition from dedicated hardware to software in 2016.
One of the advantages of DSPs is that they have traditionally been highly programmable while also having very low power consumption. Combining these two benefits with matrix multiplication has led companies to the NPU in many cases. I learned about DSPs in my early days with an electronic prototype design firm that worked a lot with TIâs DSPs in the mid-2000s. In the past, Xilinx called its AI accelerator a DPU, while Intel called it a vision processing unit as a legacy from its acquisition of low-power AI accelerator maker Movidius. All of these have something in common, in that they all come from a processor designed to analyze analog signals (e.g., sound or imagery) and process those signals quickly and at extremely low power.
Qualcommâs NPU
As for Qualcomm, I have personally witnessed its journey from the Hexagon DSP to the Hexagon NPU, during which the company has continually invested in incremental improvements for every generation. Now Qualcommâs NPU is powerful enough to claim 45 TOPS of AI performance on its own. In fact, as far as back as 2017, Qualcomm was talking about AI performance inside the Hexagon DSP, and about leveraging it alongside the GPU for AI workloads. While there were no performance claims for the Hexagon 682 inside the Snapdragon 835 SoC, which shipped that year, the Snapdragon 845 of 2018 included a Hexagon 685 capable of a whopping 3 TOPS thanks to a technology called HVX. By the time Qualcomm put the Hexagon 698 inside the Snapdragon 865 in 2019, the component was no longer being called a DSP; now it was a fifth-generation âAI engine,â which means that the current Snapdragon 8 Gen 3 and Snapdragon X Elite are Qualcommâs ninth generation of AI engines.
The Rest Of The AI PC NPU Landscape
Not all NPUs are the same. In fact, we still donât fully understand what everyoneâs NPU architectures are, nor how fast they run, which keeps us from being able to fully compare them. That said, Intel has been very open about the NPU in the Intel Core Ultra model code-named Meteor Lake. Right now, Appleâs M3 Neural Engine ships with 18 TOPS of AI performance, while Intelâs NPU has 11 and the XDNA NPU in AMDâs Ryzen 8040 (a.k.a. Hawk Point) has 16 TOPS. These numbers all seem low when you compare them to Qualcommâs Snapdragon X Elite, which has an NPU-only TOPS of 45 and a complete system TOPS of 75. In fact, Meteor Lakeâs complete system TOPS is 34, while the Ryzen 8040 is 39âboth of which are lower than Qualcommâs NPU-only performance. While I expect Intel and AMD to downplay the role of the NPU initially and Qualcomm to play it up, it does seem that the landscape may become much more interesting at the end of this year moving into early next year.
Shifting Apps From The Cloud To The NPU
While the CPU and GPU are still extremely relevant for everyday use in PCs, the NPU has become the center of attention for many in the industry as an area for differentiation. One open question is whether the NPU is relevant enough to justify being a technology focus and, if so, how much performance is enough to deliver an adequate experience? In the bigger picture, I believe that NPUs and their TOPS performance have already become a major battlefield within the PC sector. This is especially true if you consider how many applications might target the NPU simultaneouslyâand possibly bog it down if there isnât enough performance headroom.
With so much focus on the NPU inside the AI PC, it makes sense that there must be applications that take advantage of that NPU to justify its existence. Today, most AI applications live in the cloud because thatâs where most AI compute resides. As more of these applications shift from the cloud to a hybrid model, there will be an increased dependency on local NPUs to offload AI functions from the cloud. Additionally, there will be applications that require higher levels of security for which IT simply wonât allow data to leave the local machine; these applications will be entirely dependent on local compute. Ironically, I believe that one of those key application areas will be security itself, given that security has traditionally been one of the biggest resource hogs for enterprise systems.
As time progresses, more LLMs and other models will be quantized in ways that will enable them to have a smaller footprint on the local device while also improving accuracy. This will enable more on-device AI that has a much better contextual understanding of the local deviceâs data, and that performs with lower latency. I also believe that while some AI applications will initially deploy as hybrid apps, there will still be some IT organizations that want to deploy on-device first; the earliest versions of those applications will likely not be as optimized as possible and will likely take up more compute, driving more demand for higher TOPS from AI chips.
Increasing Momentum
However, the race for NPU dominance and relevance has only just begun. Qualcommâs Snapdragon X Elite is expected to be the NPU TOPS leader when the company launches in the middle of this year, but the company will not be alone. AMD has already committed to delivering 40 TOPS of NPU performance in its next-generation Strix Point Ryzen processors due early next year, while at its recent Vision 2024 conference Intel claimed 100 TOPS of platform-level AI performance for the Lunar Lake chips due in Q4 of 2024. (Recall that Qualcommâs Snapdragon X Elite claims 75 TOPS across the GPU, CPU and NPU.) While it isnât official, there is an understanding across the PC ecosystem that Microsoft put a requirement on its silicon vendor partners to deliver at least 40 TOPS of NPU AI performance for running Copilot locally.
One item of note is that most companies are apparently not scaling their NPU performance based on product tier; rather, NPU performance is the same across all platforms. This means that developers can target a single NPU per vendor, which is good news for the developers because optimizing for an NPU is still quite an undertaking. Thankfully, there are low-level APIs such as DirectML and frameworks including ONNX that will hopefully help reduce the burden on developers so they donât have to target every type of NPU on their own. That said, I do believe that each chip vendor will also have its own set of APIs and SDKs that can help developers take even more advantage of the performance and power savings of their NPUs.
Wrapping Up
The NPU is quickly becoming the new focus for an industry looking for ways to address the costs and latency that come with cloud-based AI computing. While some companies already have high-performance NPUs, there is a clear and very pressing desire for OEMs to use processors that include NPUs with at least 40 TOPS. There will be an accelerated shift towards on-device AI, which will likely start with hybrid apps and models and in time shift towards mostly on-device computing. This does mean that the NPUâs importance will be less relevant early on for some platforms, but having a less powerful NPU may also translate to not delivering the best possible AI PC experiences.
There are still a lot of unknowns about the complete AI PC vision, especially considering how many different vendors are involved, but I hear that a lot of things will get cleared up at Microsoftâs Build conference in late May. That said, I believe the battle for the AI PC will likely drag on well into 2025 as more chip vendors and OEMs adopt faster and more capable NPUs.