Thanks to innovations like DeepSeek, training AI has become cheaper. However, inference is becoming more demanding as we ask AI to think harder before answering our questions. Nvidia, Groq, and Cerebras Systems (clients of Cambrian-AI Research) have all released massive accelerators and infrastructure to support this trend. I suspect we will see more from Nvidia about inference next week than training, including clouds, robots, and cars. Jensen Huang has said this reasoning style of inference processing is 100 times more computationally demanding. I found in a recent experiment that reasoning can be even 200 times more expensive but far more intelligent and more valuable!
Cerebras Takes Inference To a New Level
Cerebras Systems, the creator of wafer-scale, Frisbee-sized AI chips, has rolled out a plan to build six new data centers since entering the “high-value” token business. The company claims it will become the largest provider of such inference services globally by the end of this year. The new data centers are partially up and running today and will soon expand to France and Canada. The aggregate capacity of these systems, which will number in the thousands, will exceed 40 million Llama 70B tokens per second.
High-value tokens carry more contextual information and are typically more important for understanding the overall meaning of a text. They often represent key concepts, rare words, or specialized terminology. High-value tokens consume more computational resources and may cost more to process. This is because they typically require more attention from the model and contribute more significantly to the final output. Low-value tokens, which are more common and less informationally dense, usually require fewer processing resources. Clearly, Cerebras is targeting problems that are a good fit for its wafer-scale approach to AI.
This level of performance in delivering high-value tokens is attracting new enterprise customers that also need elastic services to meet their needs. AlphaSense, for example, a leading market intelligence platform, has moved to Cerebras Inference, replacing a top-three closed-source AI model provider. The company has also landed Perplexity, Mistral, Hugging Face, and other users of high-value inferencing, delivering inference performance 10 to 20 times faster than alternatives.
The Inference Revolution is Just Beginning
Next week, we will hear more about “high-value” tokens from Nvidia at GTC, as the inference market overtakes training in global revenue. Markets such as autonomous vehicles, robots, and sovereign data centers all depend on fast inference, and Nvidia does not plan to let that market pass them by. The high-value concept is new, and platforms like Cerebras and Nvidia LVL72 are ideal for delivering it.