Selective State Space Models: Solving the Cost-Quality Tradeoff

As AI is increasingly used in production scenarios, costs are mounting. Are alternative architectures the solution?

Daniel LaBruna
2 min read August 12, 2024
Domain Insights Infra

One of the great drawbacks of Transformers’ attention-based models is the computational cost of inference. Unlike training, which can be parallelized, attention-based inference scales quadratically in both cost and speed with input sequence length, limiting context window size. Flash Attention (recently updated for the H100) helps and has become a standard, but it doesn’t entirely remove the quadratic scaling constraint.

So, when the original Mamba SSM paper promised both parallelizable training and inference that scaled linearly with input length, there was understandable excitement. Structured state space models (SSMs) had existed before, but they suffered from the opposite problem of transformers: their inference scaled linearly, but training could not be parallelized. Additionally, their static state matrices were fixed after training and could not respond dynamically to changing inputs. This was a critical problem that only attention seemed to solve. It appeared that you could have one or the other: quality or cost.

Mamba (and Selective SSMs broadly) challenged this paradigm: they offered states that vary based on input (akin to attention), and computations that can be done in parallel (inference that scales linearly). The result is a model that generates with transformer-level fidelity but dramatically improved inference efficiency.

Development has continued since this paper was published, with researchers experimenting on an improved Mamba 2 model, compact Zamba model (well-suited to edge deployment), as well as hybrid attention + SSM models that claim state-of-the-art performance. While these early results have been promising enough to motivate further research, it will be interesting to see whether Selective SSMs will become the accepted solution to the inference cost-quality tradeoff.

The answer will likely depend more on transformers than SSMs. Incremental improvements continue to be eked out by updated Flash and Sparse Attention mechanisms, as well as speculative decoding algorithms. But will these methods reach a theoretical limit? And if they do, will the cost-quality tradeoff become severe enough to motivate the widespread adoption of an alternative architecture? Only time will tell.

If you’re using these models in the field, send me your thoughts at dlabruna@baincapital.com.

Related Insights