mirror of
https://www.modelscope.cn/iic/speech_sambert-hifigan_tts_zh-cn_16k.git
synced 2026-04-02 10:22:54 +08:00
update new multi voice modelcard
This commit is contained in:
163
README.md
163
README.md
@ -1,5 +1,164 @@
|
||||
---
|
||||
tasks:
|
||||
- text-to-speech
|
||||
domain:
|
||||
- audio
|
||||
frameworks:
|
||||
- tensorflow
|
||||
- pytorch
|
||||
backbone:
|
||||
- transformer
|
||||
metrics:
|
||||
- MOS
|
||||
license: Apache License 2.0
|
||||
tags:
|
||||
- Alibaba
|
||||
- tts
|
||||
- hifigan
|
||||
- sambert
|
||||
- text-to-speech
|
||||
- zhcn
|
||||
widgets:
|
||||
- task: text-to-speech
|
||||
inputs:
|
||||
- type: text
|
||||
name: input
|
||||
title: 文本
|
||||
validator:
|
||||
max_words: 30
|
||||
examples:
|
||||
- name: 1
|
||||
title: 示例1
|
||||
inputs:
|
||||
- name: input
|
||||
data: 北京今天天气怎么样
|
||||
inferencespec:
|
||||
cpu: 4 #CPU数量
|
||||
memory: 8192
|
||||
gpu: 1 #GPU数量
|
||||
gpu_memory: 8192
|
||||
---
|
||||
#### Clone with HTTP
|
||||
* http://www.modelscope.cn/damo/speech_sambert-hifigan_tts_zh-cn_16k.git
|
||||
|
||||
# Sambert-Hifigan模型介绍
|
||||
|
||||
|
||||
## 框架描述
|
||||
拼接法和参数法是两种Text-To-Speech(TTS)技术路线。近年来参数TTS系统获得了广泛的应用,故此处仅涉及参数法。
|
||||
|
||||
参数TTS系统可分为两大模块:前端和后端。
|
||||
前端包含文本正则、分词、多音字预测、文本转音素和韵律预测等模块,它的功能是把输入文本进行解析,获得音素、音调、停顿和位置等语言学特征。
|
||||
后端包含时长模型、声学模型和声码器,它的功能是将语言学特征转换为语音。其中,时长模型的功能是给定语言学特征,获得每一个建模单元(例如:音素)的时长信息;声学模型则基于语言学特征和时长信息预测声学特征;声码器则将声学特征转换为对应的语音波形。
|
||||
|
||||
其系统结构如[图1]所示:
|
||||
|
||||

|
||||
|
||||
前端模块我们采用模型结合规则的方式灵活处理各种场景下的文本,后端模块则采用SAM-BERT + HIFIGAN提供高表现力的流式合成效果。
|
||||
|
||||
### 声学模型SAM-BERT
|
||||
后端模块中声学模型采用自研的SAM-BERT,将时长模型和声学模型联合进行建模。结构如[图2]所示
|
||||
```
|
||||
1. Backbone采用Self-Attention-Mechanism(SAM),提升模型建模能力。
|
||||
2. Encoder部分采用BERT进行初始化,引入更多文本信息,提升合成韵律。
|
||||
3. Variance Adaptor对音素级别的韵律(基频、能量、时长)轮廓进行粗粒度的预测,再通过decoder进行帧级别细粒度的建模;并在时长预测时考虑到其与基频、能量的关联信息,结合自回归结构,进一步提升韵律自然度.
|
||||
4. Decoder部分采用PNCA AR-Decoder[@li2020robutrans],自然支持流式合成。
|
||||
```
|
||||
|
||||
|
||||

|
||||
|
||||
### 声码器模型:HIFI-GAN
|
||||
后端模块中声码器采用HIFI-GAN, 基于GAN的方式利用判别器(Discriminator)来指导声码器(即生成器Generator)的训练,相较于经典的自回归式逐样本点CE训练, 训练方式更加自然,在生成效率和效果上具有明显的优势。其系统结构如[图3]所示:
|
||||
|
||||

|
||||
|
||||
在HIFI-GAN开源工作[1]的基础上,我们针对16k, 48k采样率下的模型结构进行了调优设计,并提供了基于因果卷积的低时延流式生成和chunk流式生成机制,可与声学模型配合支持CPU、GPU等硬件条件下的实时流式合成。
|
||||
|
||||
## 使用方式和范围
|
||||
|
||||
使用方式:
|
||||
* 直接推理,输入为以测试标签为键以待合成文本为值的Dict对象,默认使用voices文件夹下voices.json中第一个发音人
|
||||
|
||||
使用范围:
|
||||
* 适用于英式英文的语音合成场景,输入文本使用utf-8编码,整体长度建议不超过30字
|
||||
|
||||
目标场景:
|
||||
* 各种语音合成任务,比如配音,虚拟主播,数字人等
|
||||
|
||||
### 如何使用
|
||||
目前仅支持Linux使用,暂不支持Windows及Mac使用。在安装完成ModelScope-lib之后即可使用,支持的发音人名称请参考voices文件夹下面的voices.json
|
||||
|
||||
#### 代码范例
|
||||
```Python
|
||||
from modelscope.metainfo import Pipelines
|
||||
from modelscope.models import Model
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.pipelines.outputs import OutputKeys
|
||||
from modelscope.utils.constant import Tasks, Fields
|
||||
|
||||
single_test_case_label = 'test_case_label_0'
|
||||
text = '待合成文本'
|
||||
voice = 'zhitian_emo'
|
||||
model_id = 'damo\speech_sambert-hifigan_tts_zhcn_16k'
|
||||
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id)
|
||||
test_cases = {single_test_case_label: text}
|
||||
test_cases['voice'] = voice
|
||||
output = sambert_tts(test_cases)
|
||||
# save to file output.wav
|
||||
write('output.wav', 16000, output[OutputKeys.OUTPUT_WAV][single_test_case_label])
|
||||
```
|
||||
|
||||
### 模型局限性以及可能的偏差
|
||||
* 该发音人支持中文及英文混合,TN规则为中文
|
||||
* 目前支持发音人zhitian_emo,zhiyan_emo,zhizhe_emo,zhibei_emo
|
||||
|
||||
## 训练数据介绍
|
||||
暂无
|
||||
|
||||
## 模型训练流程
|
||||
暂无
|
||||
|
||||
### 预处理
|
||||
暂无
|
||||
|
||||
## 数据评估及结果
|
||||
暂无
|
||||
|
||||
## 引用
|
||||
如果你觉得这个该模型对有所帮助,请考虑引用下面的相关的论文:
|
||||
|
||||
```BibTeX
|
||||
@inproceedings{li2020robutrans,
|
||||
title={Robutrans: A robust transformer-based text-to-speech model},
|
||||
author={Li, Naihan and Liu, Yanqing and Wu, Yu and Liu, Shujie and Zhao, Sheng and Liu, Ming},
|
||||
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
|
||||
volume={34},
|
||||
number={05},
|
||||
pages={8228--8235},
|
||||
year={2020}
|
||||
}
|
||||
```
|
||||
|
||||
```BibTeX
|
||||
@article{devlin2018bert,
|
||||
title={Bert: Pre-training of deep bidirectional transformers for language understanding},
|
||||
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
|
||||
journal={arXiv preprint arXiv:1810.04805},
|
||||
year={2018}
|
||||
}
|
||||
```
|
||||
```BibTeX
|
||||
@article{kong2020hifi,
|
||||
title={Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis},
|
||||
author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung},
|
||||
journal={Advances in Neural Information Processing Systems},
|
||||
volume={33},
|
||||
pages={17022--17033},
|
||||
year={2020}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
- [1] https://github.com/jik876/hifi-gan
|
||||
|
||||
|
||||
127
configuration.json
Normal file
127
configuration.json
Normal file
@ -0,0 +1,127 @@
|
||||
{
|
||||
"framework": "tensorflow",
|
||||
"task" : "text-to-speech",
|
||||
"model" : {
|
||||
"type" : "sambert-hifigan",
|
||||
"lang_type" : "zhcn",
|
||||
"sample_rate" : 16000,
|
||||
"am": {
|
||||
"cleaners":"english_cleaners",
|
||||
|
||||
"num_mels":80,
|
||||
"sample_rate":16000,
|
||||
"frame_shift_ms":12.5,
|
||||
|
||||
|
||||
"embedding_dim":512,
|
||||
"encoder_n_conv_layers":3,
|
||||
"encoder_filters":256,
|
||||
"encoder_kernel_size":5,
|
||||
|
||||
"encoder_num_layers":8,
|
||||
"encoder_num_units":128,
|
||||
"encoder_num_heads":8,
|
||||
"encoder_ffn_inner_dim":1024,
|
||||
"encoder_dropout":0.1,
|
||||
"encoder_attention_dropout":0.1,
|
||||
"encoder_relu_dropout":0.1,
|
||||
"encoder_projection_units":32,
|
||||
|
||||
"predictor_filter_size":41,
|
||||
"predictor_fsmn_num_layers":3,
|
||||
"predictor_dnn_num_layers":0,
|
||||
"predictor_num_memory_units":128,
|
||||
"predictor_ffn_inner_dim":256,
|
||||
"predictor_dropout":0.1,
|
||||
"predictor_shift":0,
|
||||
|
||||
"predictor_prenet_units":[128, 128],
|
||||
"predictor_lstm_units":128,
|
||||
|
||||
"prenet_units":[256, 256],
|
||||
"prenet_proj_units":128,
|
||||
|
||||
"decoder_num_layers":12,
|
||||
"decoder_num_units":128,
|
||||
"decoder_num_heads":8,
|
||||
"decoder_ffn_inner_dim":1024,
|
||||
"decoder_dropout":0.1,
|
||||
"decoder_attention_dropout":0.1,
|
||||
"decoder_relu_dropout":0.1,
|
||||
|
||||
"outputs_per_step":3,
|
||||
|
||||
"postnet_filter_size":41,
|
||||
"postnet_fsmn_num_layers":4,
|
||||
"postnet_dnn_num_layers":0,
|
||||
"postnet_num_memory_units":256,
|
||||
"postnet_ffn_inner_dim":512,
|
||||
"postnet_dropout":0.1,
|
||||
"postnet_shift":17,
|
||||
"postnet_lstm_units":128,
|
||||
|
||||
|
||||
"dur_scale":1.0,
|
||||
|
||||
"batch_size":32,
|
||||
"adam_beta1":0.9,
|
||||
"adam_beta2":0.999,
|
||||
"initial_learning_rate":0.002,
|
||||
"decay_learning_rate":true,
|
||||
"use_cmudict":false,
|
||||
|
||||
"lfeat_type_list":"sy,tone,syllable_flag,word_segment,emo_category,speaker",
|
||||
|
||||
"guided_attention":false,
|
||||
"guided_attention_2g_squared":0.08,
|
||||
"guided_attention_loss_weight":1.0,
|
||||
|
||||
"free_run":false,
|
||||
|
||||
"X_band_width":40,
|
||||
"H_band_width":40,
|
||||
|
||||
"max_len":900
|
||||
},
|
||||
"vocoder" : {
|
||||
"resblock": "1",
|
||||
"num_gpus": 1,
|
||||
"batch_size": 16,
|
||||
"learning_rate": 0.0002,
|
||||
"adam_b1": 0.8,
|
||||
"adam_b2": 0.99,
|
||||
"lr_decay": 0.999,
|
||||
"seed": 1234,
|
||||
|
||||
"upsample_rates": [10,5,2,2],
|
||||
"upsample_kernel_sizes": [20,11,4,4],
|
||||
"upsample_initial_channel": 256,
|
||||
"resblock_kernel_sizes": [3,7,11],
|
||||
"resblock_dilation_sizes": [[1,3,5,7], [1,3,5,7], [1,3,5,7]],
|
||||
|
||||
"segment_size": 6400,
|
||||
"num_mels": 80,
|
||||
"num_freq": 1025,
|
||||
"n_fft": 2048,
|
||||
"hop_size": 200,
|
||||
"win_size": 1000,
|
||||
|
||||
"sampling_rate": 16000,
|
||||
|
||||
"fmin": 0,
|
||||
"fmax": 8000,
|
||||
"fmax_for_loss": null,
|
||||
|
||||
"num_workers": 4,
|
||||
|
||||
"dist_config": {
|
||||
"dist_backend": "nccl",
|
||||
"dist_url": "tcp://localhost:54312",
|
||||
"world_size": 1
|
||||
}
|
||||
}
|
||||
},
|
||||
"pipeline": {
|
||||
"type": "sambert-hifigan-tts"
|
||||
}
|
||||
}
|
||||
BIN
description/hifigan.png
Normal file
BIN
description/hifigan.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 140 KiB |
BIN
description/sambert.png
Normal file
BIN
description/sambert.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 137 KiB |
BIN
description/tts-system.png
Normal file
BIN
description/tts-system.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 107 KiB |
BIN
resource.zip
(Stored with Git LFS)
Normal file
BIN
resource.zip
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
voices.zip
(Stored with Git LFS)
Normal file
BIN
voices.zip
(Stored with Git LFS)
Normal file
Binary file not shown.
Reference in New Issue
Block a user