The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Kertas kerja ini mencadangkan kaedah berasaskan ritma pertuturan untuk pemasukan penutur bagi memodelkan tempoh fonem menggunakan beberapa sebutan oleh penutur sasaran. Irama pertuturan adalah salah satu faktor penting di antara ciri penutur, bersama dengan ciri akustik seperti F0, untuk menghasilkan semula ujaran individu dalam sintesis pertuturan. Ciri baharu kaedah yang dicadangkan ialah benam berasaskan irama yang diekstrak daripada fonem dan tempohnya, yang diketahui berkaitan dengan irama pertuturan. Ia diekstrak dengan model pengenalan pembesar suara yang serupa dengan model berasaskan ciri spektrum konvensional. Kami menjalankan tiga percubaan, penjanaan pembenaman pembesar suara, sintesis pertuturan dengan pembenaman terjana dan analisis ruang pembenaman, untuk menilai prestasi. Kaedah yang dicadangkan menunjukkan prestasi pengenalan pembesar suara yang sederhana (15.2% EER), walaupun dengan hanya fonem dan maklumat tempohnya. Keputusan penilaian objektif dan subjektif menunjukkan bahawa kaedah yang dicadangkan boleh mensintesis pertuturan dengan irama pertuturan lebih dekat dengan penutur sasaran berbanding kaedah konvensional. Kami juga memvisualisasikan benam untuk menilai hubungan antara jarak benam dan persamaan persepsi. Visualisasi ruang benam dan analisis hubungan antara kedekatan menunjukkan bahawa taburan benam mencerminkan persamaan subjektif dan objektif.
Kenichi FUJITA
NTT Corporation
Atsushi ANDO
NTT Corporation
Yusuke IJIMA
NTT Corporation
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Salinan
Kenichi FUJITA, Atsushi ANDO, Yusuke IJIMA, "Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis" in IEICE TRANSACTIONS on Information,
vol. E107-D, no. 1, pp. 93-104, January 2024, doi: 10.1587/transinf.2023EDP7039.
Abstract: This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2023EDP7039/_p
Salinan
@ARTICLE{e107-d_1_93,
author={Kenichi FUJITA, Atsushi ANDO, Yusuke IJIMA, },
journal={IEICE TRANSACTIONS on Information},
title={Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis},
year={2024},
volume={E107-D},
number={1},
pages={93-104},
abstract={This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.},
keywords={},
doi={10.1587/transinf.2023EDP7039},
ISSN={1745-1361},
month={January},}
Salinan
TY - JOUR
TI - Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
T2 - IEICE TRANSACTIONS on Information
SP - 93
EP - 104
AU - Kenichi FUJITA
AU - Atsushi ANDO
AU - Yusuke IJIMA
PY - 2024
DO - 10.1587/transinf.2023EDP7039
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E107-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2024
AB - This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.
ER -