DM-Codec: Distilling Multimodal Representations for Speech Tokenization

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_3119340308

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

About this item

Full title

Author / Creator

Ahasan, Md Mubtasim , Fahim, Md , Mohiuddin, Tasnim , A K M Mahbubur Rahman , Chadha, Aman , Iqbal, Tariq , Amin, M Ashraful , Islam, Md Mofijul and Amin Ahsan Ali

Publisher

Ithaca: Cornell University Library, arXiv.org

Journal title

arXiv.org, 2024-10

Language

English

Formats

Articles

Publication information

Publisher

Ithaca: Cornell University Library, arXiv.org

Subjects

Subjects and topics

More information

Scope and Contents

Contents

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at https://github.com/mubtasimahasan/DM-Codec....

Alternative Titles

Full title

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Authors, Artists and Contributors

Author / Creator

Ahasan, Md Mubtasim
Fahim, Md
Mohiuddin, Tasnim
A K M Mahbubur Rahman
Chadha, Aman
Iqbal, Tariq
Amin, M Ashraful
Islam, Md Mofijul
Amin Ahsan Ali

Identifiers

Primary Identifiers

Record Identifier

TN_cdi_proquest_journals_3119340308

Permalink

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_3119340308

Other Identifiers

E-ISSN

2331-8422

How to access this item

Full text available

View in old catalogue

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

About this item

Publication information

Subjects

More information

Scope and Contents

Alternative Titles

Authors, Artists and Contributors

Identifiers

Primary Identifiers

Other Identifiers

How to access this item

Connecting people and collections

Indigenous engagement

Learning

Stories