
Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang
Under review 2026
We propose a hybrid Auto-Regressive (AR) and Non-Auto-Regressive (NAR) architecture for coarse-to-fine music generation. Our approach employs a State-Space Model (SSM) as the language model to generate coarse tokens, followed by a pre-trained diffusion model for fine-grained refinement. By leveraging the linear scaling of SSMs, our model achieves significantly higher training efficiency compared to traditional Transformer-based architectures.
Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang
Under review 2026
We propose a hybrid Auto-Regressive (AR) and Non-Auto-Regressive (NAR) architecture for coarse-to-fine music generation. Our approach employs a State-Space Model (SSM) as the language model to generate coarse tokens, followed by a pre-trained diffusion model for fine-grained refinement. By leveraging the linear scaling of SSMs, our model achieves significantly higher training efficiency compared to traditional Transformer-based architectures.
Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang
International Society for Music Information Retrieval, Late Breaking Demo 2025
We investigates the potential of Mamba-based State Space Models (SSMs) as an efficient alternative to Transformers for text-to-music generation. By adopting a single-layer codebook representation and adapting the SiMBA architecture into a decoder, the proposed model achieves significantly faster convergence and produces outputs closer to the ground truth under limited-resource settings. The findings demonstrate that SSMs offer a promising path for developing efficient and expressive music language models that maintain high performance with lower computational overhead.
Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang
International Society for Music Information Retrieval, Late Breaking Demo 2025
We investigates the potential of Mamba-based State Space Models (SSMs) as an efficient alternative to Transformers for text-to-music generation. By adopting a single-layer codebook representation and adapting the SiMBA architecture into a decoder, the proposed model achieves significantly faster convergence and produces outputs closer to the ground truth under limited-resource settings. The findings demonstrate that SSMs offer a promising path for developing efficient and expressive music language models that maintain high performance with lower computational overhead.
Fang-Duo Tsai, Shih-Lun Wu, Wei-Jaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang
International Conference on Machine Learning 2025
We propose MuseControlLite, a lightweight fine-tuning mechanism that uses rotary positional embeddings and decoupled cross-attention to achieve precise, time-varying control over music generation. This model achieves superior melody accuracy while requiring nearly 7 times fewer trainable parameters than state-of-the-art ControlNet-based architectures. It is the first framework to simultaneously handle musical attribute control (melody, rhythm, and dynamics) alongside reference audio for seamless inpainting and outpainting.
Fang-Duo Tsai, Shih-Lun Wu, Wei-Jaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang
International Conference on Machine Learning 2025
We propose MuseControlLite, a lightweight fine-tuning mechanism that uses rotary positional embeddings and decoupled cross-attention to achieve precise, time-varying control over music generation. This model achieves superior melody accuracy while requiring nearly 7 times fewer trainable parameters than state-of-the-art ControlNet-based architectures. It is the first framework to simultaneously handle musical attribute control (melody, rhythm, and dynamics) alongside reference audio for seamless inpainting and outpainting.