Exploring State-Space-Model Based Language Model
in Music Generation

Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, and Yi-Hsuan Yang

Graduate Institute of Communication Engineering, National Taiwan University, Taiwan

weijaw2000@gmail.com

Abstract

The recent surge in State Space Models (SSMs), particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for music generation. We adopt discrete tokens of Residual Vector Quantization (RVQ) as the modeling representation and empirically find that a single-layer codebook suffices to capture the majority of semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that SiMBA achieves faster convergence and generates outputs closer to ground truth under limited-resource settings, highlighting the promise of SSMs for efficient and expressive music generation.

Demo Audio Samples

Reconstructed Audio With Differnt Layers of RVQ tokens by DAC

Here demonstrating the audio samples that is reconstructed by DAC with differnt layers of RVQ tokens.

Original Audio	1 Layer	3 Layers	6 Layers	9 Layers (Full)

Prefix SiMBA vs. Cross Transformer (Random Sampled)

This section shows the generated audio by Prefix SiMBA and Cross Transformer at 85k training steps. As mentioned in paper, we only modeling on the single-codebook of DAC. You may refer to the “1 Layer” column in the table as the quality baseline. All the text prompt are randomly selected from MusicCaps, and audio samples here are randomly selected as well.

Sampled at 85k Training Steps
Text Prompt	Prefix SiMBA	Cross Transformer
The low quality recording features a Champeta song that consists of wooden percussive elements, electric guitar solo melody, funky electric rhythm guitar chords, shimmering shakers and hi hats and smooth bass guitar. It sounds funky and like something you would dance to in a bar.
Someone is playing a song through speakers. The song contains digital drums with strong loud rhythmic hits rising up in pitch and a kick on every beat. A guitar-like sound is playing an arpeggio melody with other string instruments. This is an amateur recording and of poor sound-quality. This song may be playing at a rave party.
This live recording features an instrumental song. This Regional Mexican song features an accordion playing the main melody. This is accompanied by an acoustic guitar strumming chords. A double bass plays the bass notes. A tambourine acts as the percussion. This song is in an upbeat mood. The song can be played at the entrance of a carnival.
The low quality recording features a manically played piano melody over punchy kick and snare hits, followed by uptempo hi hats and shimmering open hats. It sounds aggressive, mani and thin, as it lacks low frequencies.
This song is an instrumental. The tempo is medium with a melodious keyboard harmony, rhythmic guitar accompaniment, punchy drumming, subtle bass line, synth arrangement and tambourine beats. The music is soft, mellow, ambient, pleasant, uplifting, and mellifluous.
Someone is playing a track from speakers. This song contains a strong e-bass playing a funky bassline along with a funky drum groove. Then a piano comes in playing a jazzy melody in one scale accompanied by a synth brass sound swelling into existence and playing a short rise before leaving again. This is an amateur recording but of decent audio-quality. This song may be playing in a jazzbar.
The song is an instrumental. The song is medium tempo with a guitar playing solo, steady drum rhythm, groovy bass line and steady bass line. The song is funky and groovy in nature. The song is an ad jingle.
This is a thrilling orchestral piece that feels epic, suspenseful and intense. There are string instruments like violins and cellos, and timpani for percussion. There is one over-bearing vibrational sound that overwhelms the whole arrangement - it is played once and rings for three seconds.
An acoustic drum is playing a faster groove along with a bassline and a harmonica playing a chord on the offbeat and in a lower key. A e-guitar is playing a melody in the mid range. There are some background noises that remind me of someone brushing his/her teeth. This song may be playing live at a bar.
This music is an Electronic dance instrumental. The tempo is medium fast with punchy drum beats and groovy electronic arrangements.The music is buoyant, energetic, electric, pulsating, youthful and vigorous. The thumpy, rhythmic bass and drumming gives it a groovy dance beat.

Abstract

Demo Audio Samples

Reconstructed Audio With Differnt Layers of RVQ tokens by DAC

Original Audio

1 Layer

3 Layers

6 Layers

9 Layers (Full)

Prefix SiMBA vs. Cross Transformer (Random Sampled)

Sampled at 85k Training Steps

Text Prompt

Prefix SiMBA

Cross Transformer