Selective State Space Models: Solving the Cost-Quality Tradeoff

12 Aug 2024|2 mins

As AI is increasingly used in production scenarios, costs are mounting. Are alternative architectures the solution?

Domain Insights

Daniel LaBruna

One of the great drawbacks of Transformers’ attention-based models is the computational cost of inference. Unlike training, which can be parallelized, attention-based inference scales quadratically in both cost and speed with input sequence length, limiting context window size. Flash Attention (recently updated for the H100) helps and has become a standard, but it doesn’t entirely remove the quadratic scaling constraint.

So, whenthe original Mamba SSM paperpromised both parallelizable training and inference that scaled linearly with input length, there was understandable excitement. Structured state space models (SSMs) had existed before, but they suffered from the opposite problem of transformers: their inference scaled linearly, but training could not be parallelized. Additionally, their static state matrices were fixed after training and could not respond dynamically to changing inputs. This was a critical problem that only attention seemed to solve. It appeared that you could have one or the other: quality or cost.

Mamba (and Selective SSMs broadly) challenged this paradigm: they offered states that vary based on input (akin to attention), and computations that can be done in parallel (inference that scales linearly). The result is a model that generates with transformer-level fidelity but dramatically improved inference efficiency.

Development has continued since this paper was published, with researchers experimenting on an improvedMamba 2 model, compactZamba model(well-suited to edge deployment), as well as hybridattention + SSM modelsthat claim state-of-the-art performance. While these early results have been promising enough to motivate further research, it will be interesting to see whether Selective SSMs will become the accepted solution to the inference cost-quality tradeoff.

The answer will likely depend more on transformers than SSMs. Incremental improvements continue to be eked out by updated Flash and Sparse Attention mechanisms, as well as speculative decoding algorithms. But will these methods reach a theoretical limit? And if they do, will the cost-quality tradeoff become severe enough to motivate the widespread adoption of an alternative architecture? Only time will tell.

If you’re using these models in the field, send me your thoughts atdlabruna@baincapital.com.

Related insights

21 Nov 2024|5 mins

poolside is Scaling a World-Class Research Team to Help Billions Code and Realize AGI

poolside's novel research breakthroughs enable code-based reasoning previously thought impossible.

13 Nov 2024|5 mins

Early-Stage Founders: Make the Conference Call

Don’t sleep on the value of in-person gatherings to learn about your market, your buyers, and to sell your product and vision.

J. DiMento

27 Sep 2024

The MVP/ICP Handshake: How to move from founder-led to AE-led sales

You’re ready to hire your first AEs when you have an Ideal Customer Profile that “shakes hands” with your Minimum Viable Product

J. DiMento

Selective State Space Models: Solving the Cost-Quality Tradeoff

As AI is increasingly used in production scenarios, costs are mounting. Are alternative architectures the solution?

Related insights

Sign up for BCV’s monthly newsletters with updates on original content, events and what we’ve been up to.