Log in to save to my catalogue

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_doaj_primary_oai_doaj_org_article_e08805403ca34028a8d742b9a22ab0cc

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

About this item

Full title

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Publisher

Cham: Springer International Publishing

Journal title

EURASIP journal on audio, speech, and music processing, 2024-05, Vol.2024 (1), p.28-18, Article 28

Language

English

Formats

Publication information

Publisher

Cham: Springer International Publishing

More information

Scope and Contents

Contents

In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method’s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at
https://great-research.github.io/tsct-tts-demo/
, providing practical insights into its application and effectiveness....

Alternative Titles

Full title

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Authors, Artists and Contributors

Identifiers

Primary Identifiers

Record Identifier

TN_cdi_doaj_primary_oai_doaj_org_article_e08805403ca34028a8d742b9a22ab0cc

Permalink

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_doaj_primary_oai_doaj_org_article_e08805403ca34028a8d742b9a22ab0cc

Other Identifiers

ISSN

1687-4722,1687-4714

E-ISSN

1687-4722

DOI

10.1186/s13636-024-00351-9

How to access this item