MAMBA: The Architecture That Could Outperform Transformers in Language Modeling.

Shakthi Warnakualsuriya
6 min readJun 10, 2024

--

I first learned about the Mamba architecture from a YouTube video a while ago. It piqued my interest back then, but it slipped my mind because of my academic work, other projects and some project-related research that I stumbled upon. Recently, however, I saw the research paper on Mamba 2 (Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality), published on May 31, 2024, and my curiosity was reignited. The potential of Mamba 2 compelled me to revisit this fascinating topic and share my insights through this blog.

For seven years, transformers have dominated the world of language modelling. Their self-attention mechanism has been the backbone of numerous advancements in AI. However, a new architecture, named MAMBA, promises to outperform transformers in language modelling tasks while using significantly less computing. This revolutionary architecture could have potentially reshaped the landscape of AI language models.

What Makes Mamba Special?

MAMBA’s architecture builds on the principles of state-space models (SSMs) and recurrent neural networks (RNNs). Traditional transformers use a quadratic computation time relative to the input length, which becomes inefficient with longer sequences. MAMBA, on the other hand, achieves linear computation time, making it much more efficient. For an input sequence of N words, MAMBA uses `O(n log n)` compute, whereas transformers use O(n²).

The Mechanics Behind Mamba

MAMBA’s design is grounded in the principles of state-space models (SSMs) and recurrent neural networks (RNNs), with significant enhancements that address the limitations of both traditional RNNs and transformers.

State-Space Models and Selectivity

State-space models (SSMs) are known for their efficiency in handling sequences by modelling long-range dependencies through a structured approach. However, traditional SSMs struggle with discrete data, such as text, because they cannot perform content-based reasoning. MAMBA overcomes this by introducing a selection mechanism that allows the model to propagate or forget information based on the current token selectively. This selective state-space model can dynamically adjust to the input, making it more versatile and effective for language tasks.

Linear Recurrent Layers

MAMBA employs linear recurrent layers, which differ significantly from the non-linear operations typically used in traditional RNNs. These layers use linear functions for the recurrence operation, simplifying the computation and avoiding vanishing and exploding gradients. The linear recurrence operator is defined by the equation:

where `Wy​` and `Wx​` are learned matrices, `ht`​ is the hidden state at time `t`, and `xt`​ is the input vector at time `t`. By making these operations linear, MAMBA ensures that the gradients remain stable during training, allowing the model to capture long-range dependencies without the usual training difficulties associated with RNNs.

Parallel Computation and Efficiency

One of the key innovations in MAMBA is its ability to perform these linear recurrences in parallel, drastically improving computation efficiency. This is achieved through a parallel scan algorithm that computes the cumulative operations efficiently. For a sequence of length `n`, this algorithm reduces the computational complexity to `O(n log n)`, making it much more scalable than the O(n²) complexity of transformers.

Matrix Diagonalization

To further enhance computational efficiency, Mamba uses matrix diagonalization. Almost every square matrix can be factored into the product of an invertible matrix P, a diagonal matrix D, and `P^-1`. This allows MAMBA to perform matrix multiplications more efficiently, as operations on diagonal matrices are significantly faster. Specifically, MAMBA represents the recurrent weight matrix in diagonalized form, reducing the complexity of matrix operations to O(d log ⁡n), where d is the dimension of the vectors.

Output Vector Expansion

In addition to improving recurrence operations, MAMBA expands the size of the output vectors by a factor of 16. This expansion allows the model to store more information from previous inputs, significantly enhancing its ability to handle complex sequences. Despite this increase in output size, Mamba maintains its efficiency by optimizing data transfer within GPU memory. The entire MAMBA operation, including the expansion and subsequent reduction of output vectors, is computed in a single block within high-performance memory, minimizing transfer times and maintaining overall computational efficiency.

Hardware-Aware Algorithm

MAMBA’s architecture is designed with modern hardware in mind. By carefully managing the memory hierarchy on GPUs, MAMBA avoids unnecessary data transfers between different levels of memory, which can be a significant bottleneck. The selective state-space model computes the necessary operations directly in the fast GPU memory, ensuring that the expanded states are only materialized when needed. This hardware-aware approach ensures that MAMBA can achieve its theoretical performance improvements in practical, real-world scenarios.

Stability and Training

Training stability is another critical aspect of MAMBA’s design. Traditional RNNs suffer from unstable gradients, which can either vanish or explode as they propagate through the network. MAMBA mitigates this issue by carefully initializing the recurrent weights and incorporating mechanisms to maintain stable gradients. Specifically, the weights are parameterized in a way that ensures they start close to 1, providing a stable foundation for training. This careful initialization, combined with the linear nature of the recurrences, allows MAMBA to learn effectively even with long sequences.

Real-World Performance

Mamba has demonstrated impressive results in various benchmarks. In language modelling, the Mamba-3B model outperforms transformers of the same size and matches the performance of transformers twice its size. It also excels in other domains, such as audio and genomics, proving its versatility as a sequence model backbone.

Challenges and Controversies

Despite its promising results, Mamba has not been free from controversy as it should be because this directly challenges the Transformer model which is considered the state of the art for language modelling. The paper detailing Mamba’s architecture faced rejection from ICLR 2024, a prestigious machine learning conference. Reviewers criticized the lack of evaluation on unrelated benchmarks, such as the Long Range Arena, and questioned the focus solely on language modelling. However, these criticisms sparked a debate within the AI community about the fairness and accuracy of the peer review process.

Conclusion

The introduction of Mamba marked a significant milestone in the evolution of AI language models. Its innovative architecture and efficient computation methods offer a promising alternative to transformers. Seeing the recent research on Mamba 2 has reignited my interest, and I look forward to exploring it further. Stay tuned for my next blog post, where I will delve deeper into Mamba 2 and share more insights as I learn about this exciting new development.

References

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

MAMBA 2 — Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

MAMBA from Scratch: Neural Nets Better and Faster than Transformers

Transformer paper — Attention Is All You Need

--

--

Shakthi Warnakualsuriya
Shakthi Warnakualsuriya

Written by Shakthi Warnakualsuriya

DevOps Engineering Intern @IFS | GitHub Campus Expert | Computer Science Undergraduate at the University of Westminster | AI and ML enthusiast

Responses (1)