The Next Phase of Competition in AI

As performance differences between models lessen, AI defensibility comes down to compute, talent, data and distribution.

To no one’s surprise, AI models have come a long way since 2022 and have improved fantastically since 2019, when generalist NLP models were first trained at Meta and Google. But it’s also clear that the benchmarks we commonly use to ascribe model performance, and therefore model quality, are leveling. Shattering state-of-the-art benchmarks used to mean that a model improved from 27% to 60% on MMLU, an increase of over 2x (Roberta to GPT3). Now, it can mean a model went from 88.7% to 90%, an increase of 1.01x (GPT4o to Gemini Ultra).


A common interpretation is that model performance is stagnating, and that we’re finally at the beginning of the end of the fabled scaling laws that govern model performance. Our view is a bit different: what got us here may or may not get us to AGI, but that doesn't mean the end of progress. Though we believe models will continue to improve, the pure performance differences between models may become less pronounced. As this happens, maintaining advantage will increasingly come down to companies' ability to carve out defensibility along the evolving, underlying dimensions of AI competition: compute, talent, data and distribution. 

Compute: Companies will compete to maximize inference efficiency, to the benefit of app developers everywhere

Compute will continue to matter and GPU scarcity will persist. Training the next generation of foundation models will be resource intensive, but as production use cases proliferate, GPU demand for training will be exceeded by GPU demand for inference. Over the next few years, companies will compete to maximize inference efficiency, to the benefit of app developers everywhere.

The GPU scarcity conversation often centers on immense demand for GPUs in model training. Bigger models are just better. And bigger models require more GPUs. This trend isn’t going away, especially as several labs race to build the next generation of intelligence. However, it doesn’t tell the full story.

GPT-5 is supposedly being trained on anywhere from 25k to 50k Microsoft Azure H100 GPUs. Azure is reported to have about 150k H100 GPUs and is adding more every day. While it’s notable that up to one-third of those GPUs are being used to train a single model, what about the others? Why is there still scarcity when we know other super-large model providers are using their own infrastructure? The answer is inference.

Current models are really good! Real workloads are proliferating on the backs of OpenAI and Anthropic, and open models like Llama-3 and Mixtral have democratized intelligence for nearly all developers. Increased competition is reducing pricing power in inference, leading to even more inference (it’s getting a lot cheaper to build AI apps!).

These trends increasingly point to model providers and developers focusing on maximizing GPU efficiency at inference-time. This is happening in two ways. The first is small model offerings. Large flagship models are increasingly accompanied by smaller, inference-optimized models. GPT4o-mini is one example. At 8B parameters, compared to a rumored 1.8T for GPT4o, inference costs are 40x cheaper than the full GPT4o. Improvements in fine tuning indicate that any tradeoff in multi-modal quality is shrinking, and on language benchmarks, GPT4o-mini is roughly comparable to Llama 3 70B and Claude 2 (rumored to be 137B parameters, or nearly 18x the size).

The second is through increased direct inference optimizations. Quantization, FlashAttention and Speculative Decoding are already mainstays in inference-optimized models, but sparsity is the next frontier. BERT showed that only 10% of the elements in the Query * Key^T attention computation contributed meaningfully to the final attention matrix. The challenge is finding the right 10% without degrading performance. Sparsely-Sharded Attention divides input across attention heads and holds promise. Similarly, mixture of experts (MoE) models are gaining popularity because they have sparsity built in—only a small subset of “experts” are activated for any given input.

There’s no denying that supporting the inference and training demands of next-generation models will require a hard compute build-out, including GPUs, the steel racks they sit in, plumbing, HVAC, cooling, energy management, power supply and literal concrete poured out to build data centers to house all of these components. This will take time. In the interim, the focus will be on optimizing with what’s available. This should all come as welcome news to app developers. Even when compute constraints ease, this research will yield a broad selection of models across a range of quality and price points.

Talent: Mission alignment matters more than ever

At one point, money was enough to lure talent. Now, as well-funded labs compete with one another, a new dimension is emerging: mission alignment.

Talent density creates momentum: the resonant feeling of working with the very best people on something special yields compounding rewards. For a while, the research center of mass was in academia. Now, it is distinctly not.

In the 2010s, private labs bought up much of the talent from university research departments. In 2013, outstanding paper winners at ICML, NeurIPS, and ICLR were all from academic teams. By 2023, over 50% of winning teams had non-academic contributors, and 70% of the most-cited AI/ML papers are now written by non-academic labs.

For private labs, plucking researchers from academia was relatively easy. They offered similar intellectual freedom with much higher compensation. But now, competition is shifting. Instead of pulling researchers away from an under-funded lab at Stanford, companies must pull them away from a seven-figure compensation package at OpenAI or Deepmind. Financial reward is necessary, but insufficient.

At first, Google/Deepmind was a talent vortex. Then, drawn by more freedom, greater potential reward and less bureaucracy, OpenAI became the center of mass. Many still consider OpenAI the leader in talent, but several notable departures point to an additional motivation: mission alignment.

This is especially true for post-training talent. Mere hours after cofounder Ilya Sutskever left OpenAI, he was joined by Jan Leike, head of the Superaligment team that focused on aligning models to human values (safety). Several members of Jan’s team and others from across the company have now decamped to Anthropic. Musk’s xAI has also lured high-quality researchers away from OpenAI by promising to build models that are more “rebellious”.

Alignment and post-training affects how a model feels to users. The researchers working on this are increasingly choosing to move to places that build models aligned with their personal beliefs. Navigating this will be a challenge for companies. The “right” perspective on this is not yet apparent, but not having a perspective is no longer an option.

Data: Organic alignment data and proprietary alignment techniques are the next frontiers for training

Organic alignment data is a necessary input to continuous model improvement, improving generalizability for models pre-trained on publicly available data.

Attention existed long before 2017. The key insight of the ‘Attention is All You Need’ paper is that you no longer needed to attach attention to an RNN—encoder-decoder blocks alone were sufficient. This enabled parallelized training, allowing for greater model sizes, in turn calling for massive amounts of data to train larger models.

Most pre-training data comes from three places: web-scale scrapes, licensing deals with publishers and synthetic data. A fourth category is emerging with organic alignment data.

Post-training alignment is employed to help models reason on proprietary tasks ill-described by public data, such as reasoning on a dense technical document describing the air consumption dynamics of a Pratt & Whitney jet engine. At the moment, the solution appears to be fine tuning and RAG, but both require a ton of clean data. Once that data is exhausted, further improvement becomes challenging. The key is alignment—using feedback from the fluid dynamics engineers using the model to guide the system towards correctness over time. By getting even imperfect applications into the hands of skilled users, companies will be able to collect troves of this proprietary data.

Alignment techniques have thus improved to help capitalize on this new data. Initial iterations like RLHF and DPO capture some degrees of preference, but simple A-B response comparisons can nudge the model in the wrong direction. What if both answers are bad? Contextual AI’s Kahneman-Tversky Optimization (KTO) attempts to capture objective value, not simply comparative value. APO + CLAIR adds yet another technique by comparing A to A’, a response with slight modification, to tease out the specific changes that yield a desirable response.

As alignment researchers say, it will always be easier to “evaluate” answers than to “generate” more high quality sources. As high-value, low-data-quantity use cases proliferate, there is opportunity to build a differentiated advantage through organic alignment data and proprietary alignment techniques. The key will be getting AI-native apps into the hands of skilled users and driving high-repeat usage.

Distribution: Moving beyond pure performance

In a world of thousands of models, each with plummeting inference prices, differentiation is driven by distribution and product.

In the rapid ascent up the quality curve over the last two years, model performance was all that mattered. Having a better model was a nearly guaranteed way to get users to try the product and potentially convert. Quality differences are compressing, and app developers are increasingly agnostic about general model benchmarks, instead opting for their own contextual evaluations.

If there is no objectively true statement to the question, “‘Which model is best?” then competition shifts to the products that can capture users most effectively and retain them by owning a core workflow. In other words, AI apps competing against other AI apps will look very similar to when SaaS apps competed against other SaaS apps in the 2010s. Can the model produce a desirable tone, pre-empt the next task in the workflow and make the user feel like they’re getting work done faster? These are the questions that will define the next generation of AI product experience.

Great companies will stand out by executing relentlessly

Competition in AI is evolving across multiple fronts—compute, talent, data and distribution. The race to develop more efficient inference systems will reshape how applications harness AI, reducing barriers to entry for new developers while increasing demand for specialized hardware. At the same time, attracting top talent will increasingly depend on aligning individual motivations with the larger mission of AI advancements, making culture and vision as important as compensation.

In this new landscape, access to high-quality data, especially organic alignment data, becomes a strategic differentiator that allows companies to continuously fine-tune their models. However, as the models themselves become less differentiated, the true competitive edge will lie in distribution and sticky, opinionated products. Ultimately, every company is trying to build a flywheel. Better products win better users who generate more alignment data, leading to better models that underlie better products.

Raw capabilities will continue to matter. Better performance will be driven by more data, thrown at bigger models, running on larger compute stacks, inside more powerful data centers. But better models will only offer a seat at the table. Organizations that pull all the pieces together to create a differentiated end-to-end AI experience will be the ones that capture the most value in this next chapter of AI competition.