diff --git a/README.md b/README.md index 452413f..301f223 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,170 @@ --- +tasks: +- text-to-speech +domain: +- audio +frameworks: +- pytorch +backbone: +- transformer +metrics: +- MOS license: Apache License 2.0 +tags: +- Alibaba +- tts +- hifigan +- sambert +- text-to-speech +- Sichuan +- 16k +widgets: + - task: text-to-speech + inputs: + - type: text + name: input + title: 文本 + validator: + max_words: 300 + examples: + - name: 1 + title: 示例1 + inputs: + - name: input + data: 北京今天天气怎么样 + inferencespec: + cpu: 4 #CPU数量 + memory: 8192 + gpu: 1 #GPU数量 + gpu_memory: 8192 --- -###### 该模型当前使用的是默认介绍模版,处于“预发布”阶段,页面仅限所有者可见。 -###### 请根据[模型贡献文档说明](https://www.modelscope.cn/docs/%E5%A6%82%E4%BD%95%E6%92%B0%E5%86%99%E5%A5%BD%E7%94%A8%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%8D%A1%E7%89%87),及时完善模型卡片内容。ModelScope平台将在模型卡片完善后展示。谢谢您的理解。 -#### Clone with HTTP -```bash - git clone https://www.modelscope.cn/speech_tts/speech_sambert-hifigan_tts_chuangirl_Sichuan_16k.git -``` \ No newline at end of file + +# Sambert-Hifigan模型介绍 + + +## 框架描述 +拼接法和参数法是两种Text-To-Speech(TTS)技术路线。近年来参数TTS系统获得了广泛的应用,故此处仅涉及参数法。 + +参数TTS系统可分为两大模块:前端和后端。 +前端包含文本正则、分词、多音字预测、文本转音素和韵律预测等模块,它的功能是把输入文本进行解析,获得音素、音调、停顿和位置等语言学特征。 +后端包含时长模型、声学模型和声码器,它的功能是将语言学特征转换为语音。其中,时长模型的功能是给定语言学特征,获得每一个建模单元(例如:音素)的时长信息;声学模型则基于语言学特征和时长信息预测声学特征;声码器则将声学特征转换为对应的语音波形。 + +其系统结构如[图1]所示: + +![系统结构](description/tts-system.jpg) + +前端模块我们采用模型结合规则的方式灵活处理各种场景下的文本,后端模块则采用SAM-BERT + HIFIGAN提供高表现力的流式合成效果。 + +### 声学模型SAM-BERT +后端模块中声学模型采用自研的SAM-BERT,将时长模型和声学模型联合进行建模。结构如[图2]所示 +``` +1. Backbone采用Self-Attention-Mechanism(SAM),提升模型建模能力。 +2. Encoder部分采用BERT进行初始化,引入更多文本信息,提升合成韵律。 +3. Variance Adaptor对音素级别的韵律(基频、能量、时长)轮廓进行粗粒度的预测,再通过decoder进行帧级别细粒度的建模;并在时长预测时考虑到其与基频、能量的关联信息,结合自回归结构,进一步提升韵律自然度. +4. Decoder部分采用PNCA AR-Decoder[@li2020robutrans],自然支持流式合成。 +``` + + +![SAMBERT结构](description/sambert.jpg) + +### 声码器模型:HIFI-GAN +后端模块中声码器采用HIFI-GAN, 基于GAN的方式利用判别器(Discriminator)来指导声码器(即生成器Generator)的训练,相较于经典的自回归式逐样本点CE训练, 训练方式更加自然,在生成效率和效果上具有明显的优势。其系统结构如[图3]所示: + +![系统结构](description/hifigan.jpg) + +在HIFI-GAN开源工作[1]的基础上,我们针对16k, 48k采样率下的模型结构进行了调优设计,并提供了基于因果卷积的低时延流式生成和chunk流式生成机制,可与声学模型配合支持CPU、GPU等硬件条件下的实时流式合成。 + +## 使用方式和范围 + +使用方式: +* 直接输入文本进行推理 + +使用范围: +* 适用于四川话的语音合成场景,输入文本使用utf-8编码,整体长度建议不超过30字 + +目标场景: +* 各种语音合成任务,比如配音,虚拟主播,数字人等 + +### 如何使用 +目前仅支持Linux使用,暂不支持Windows及Mac使用。 +请结合[KAN-TTS](https://github.com/AlibabaResearch/KAN-TTS)代码进行finetune。具体使用方法参考: + +[sambert训练教程](https://github.com/AlibabaResearch/KAN-TTS/wiki/training_sambert) + +[hifigan训练教程](https://github.com/AlibabaResearch/KAN-TTS/wiki/training_hifigan) + +MaaS-lib暂未支持本模型训练,敬请期待。 + +#### 代码范例 +```Python +from scipy.io.wavfile import write + +from modelscope.outputs import OutputKeys +from modelscope.pipelines import pipeline +from modelscope.utils.constant import Tasks + +text = '待合成文本' +model_id = 'speech_tts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k' +sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id, model_revision='v1.0.0') +output = sambert_hifigan_tts(input=text) +pcm = output[OutputKeys.OUTPUT_PCM] +write('output.wav', 16000, pcm) +``` + +### 模型局限性以及可能的偏差 +* 该发音人支持四川话,TN规则为中文 + + +## 训练数据介绍 +使用单一发音人,共计约11.2小时数据训练, 主要为四川话。 + +## 模型训练流程 +模型所需训练数据格式为:音频(.wav), 文本标注(.txt), 音素时长标注(.interval), 随机初始化训练要求训练数据规模在2小时以上,对于2小时以下的数据集,需使用多人预训练模型进行参数初始化。其中,AM模型训练时间需要1~2天,Vocoder模型训练时间需要5~7天。 + +### 预处理 +模型训练需对音频文件提取声学特征(梅尔频谱);音素时长根据配置项中的帧长将时间单位转换成帧数;文本标注,根据配置项中的音素集、音调分类、边界分类转换成对应的one-hot编号; + + +## 引用 +如果你觉得这个该模型对有所帮助,请考虑引用下面的相关的论文: + +```BibTeX +@inproceedings{li2020robutrans, + title={Robutrans: A robust transformer-based text-to-speech model}, + author={Li, Naihan and Liu, Yanqing and Wu, Yu and Liu, Shujie and Zhao, Sheng and Liu, Ming}, + booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, + volume={34}, + number={05}, + pages={8228--8235}, + year={2020} +} +``` + +```BibTeX +@article{devlin2018bert, + title={Bert: Pre-training of deep bidirectional transformers for language understanding}, + author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, + journal={arXiv preprint arXiv:1810.04805}, + year={2018} +} +``` +```BibTeX +@article{kong2020hifi, + title={Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis}, + author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung}, + journal={Advances in Neural Information Processing Systems}, + volume={33}, + pages={17022--17033}, + year={2020} +} +``` + +本模型参考了以下实现 +- [1] [ming024's FastSpeech2 Implementation](https://github.com/ming024/FastSpeech2) +- [2] [jik876/hifi-gan](https://github.com/jik876/hifi-gan) +- [3] [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) +- [4] [mozilla/TTS](https://github.com/mozilla/TTS) +- [5] [espnet/espnet](https://github.com/espnet/espnet) + + + diff --git a/basemodel_16k/hifigan/ckpt/checkpoint_340000.pth b/basemodel_16k/hifigan/ckpt/checkpoint_340000.pth new file mode 100644 index 0000000..e936f4e --- /dev/null +++ b/basemodel_16k/hifigan/ckpt/checkpoint_340000.pth @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0231b7e43162142b6bec4ce0ca147e6e4e355154e27f79cf8e26527c905683a6 +size 907676870 diff --git a/basemodel_16k/hifigan/config.yaml b/basemodel_16k/hifigan/config.yaml new file mode 100644 index 0000000..4b59b91 --- /dev/null +++ b/basemodel_16k/hifigan/config.yaml @@ -0,0 +1,131 @@ +Loss: + discriminator_adv_loss: + enable: true + params: {average_by_discriminators: false} + weights: 1.0 + feat_match_loss: + enable: true + params: {average_by_discriminators: false, average_by_layers: false} + weights: 2.0 + generator_adv_loss: + enable: true + params: {average_by_discriminators: false} + weights: 1.0 + mel_loss: + enable: true + params: {fft_size: 2048, fmax: 8000, fmin: 0, fs: 16000, hop_size: 200, log_base: null, + num_mels: 80, win_length: 1000, window: hann} + weights: 45.0 + stft_loss: {enable: false} + subband_stft_loss: + enable: false + params: + fft_sizes: [384, 683, 171] + hop_sizes: [35, 75, 15] + win_lengths: [150, 300, 60] + window: hann_window +Model: + Generator: + optimizer: + params: + betas: [0.5, 0.9] + lr: 0.0002 + weight_decay: 0.0 + type: Adam + params: + bias: true + causal: false + channels: 256 + in_channels: 80 + kernel_size: 7 + nonlinear_activation: LeakyReLU + nonlinear_activation_params: {negative_slope: 0.1} + out_channels: 1 + resblock_dilations: + - [1, 3, 5, 7] + - [1, 3, 5, 7] + - [1, 3, 5, 7] + resblock_kernel_sizes: [3, 7, 11] + upsample_kernal_sizes: [20, 11, 4, 4] + upsample_scales: [10, 5, 2, 2] + use_weight_norm: true + scheduler: + params: + gamma: 0.5 + milestones: [200000, 400000, 600000, 800000] + type: MultiStepLR + MultiPeriodDiscriminator: + optimizer: + params: + betas: [0.5, 0.9] + lr: 0.0002 + weight_decay: 0.0 + type: Adam + params: + discriminator_params: + bias: true + channels: 32 + downsample_scales: [3, 3, 3, 3, 1] + in_channels: 1 + kernel_sizes: [5, 3] + max_downsample_channels: 1024 + nonlinear_activation: LeakyReLU + nonlinear_activation_params: {negative_slope: 0.1} + out_channels: 1 + use_spectral_norm: false + periods: [2, 3, 5, 7, 11] + scheduler: + params: + gamma: 0.5 + milestones: [200000, 400000, 600000, 800000] + type: MultiStepLR + MultiScaleDiscriminator: + optimizer: + params: + betas: [0.5, 0.9] + lr: 0.0002 + weight_decay: 0.0 + type: Adam + params: + discriminator_params: + bias: true + channels: 128 + downsample_scales: [4, 4, 4, 4, 1] + in_channels: 1 + kernel_sizes: [15, 41, 5, 3] + max_downsample_channels: 1024 + max_groups: 16 + nonlinear_activation: LeakyReLU + nonlinear_activation_params: {negative_slope: 0.1} + out_channels: 1 + downsample_pooling: DWT + downsample_pooling_params: {kernel_size: 4, padding: 2, stride: 2} + follow_official_norm: true + scales: 3 + scheduler: + params: + gamma: 0.5 + milestones: [200000, 400000, 600000, 800000] + type: MultiStepLR +allow_cache: true +audio_config: {fmax: 8000.0, fmin: 0.0, hop_length: 200, max_norm: 1.0, min_level_db: -100.0, + n_fft: 2048, n_mels: 80, norm_type: mean_std, num_workers: 16, phone_level_feature: true, + preemphasize: false, ref_level_db: 20, sampling_rate: 16000, symmetric: false, trim_silence: true, + trim_silence_threshold_db: 60, wav_normalize: true, win_length: 1000} +batch_max_steps: 9600 +batch_size: 16 +create_time: '2022-12-26 11:11:35' +discriminator_grad_norm: -1 +discriminator_train_start_steps: 0 +eval_interval_steps: 10000 +generator_grad_norm: -1 +generator_train_start_steps: 1 +git_revision_hash: 388243c0c173756d1eb34783c02cec4c302cdc25 +log_interval_steps: 1000 +model_type: hifigan +num_save_intermediate_results: 4 +num_workers: 2 +pin_memory: true +remove_short_samples: false +save_interval_steps: 20000 +train_max_steps: 2500000 diff --git a/basemodel_16k/sambert/ckpt/checkpoint_980000.pth b/basemodel_16k/sambert/ckpt/checkpoint_980000.pth new file mode 100644 index 0000000..a7feb62 --- /dev/null +++ b/basemodel_16k/sambert/ckpt/checkpoint_980000.pth @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5fc4e9e6baa9a4d1db663183003ff568a28d0e89c05b296e4a83ad4ca7102b36 +size 149428316 diff --git a/basemodel_16k/sambert/config.yaml b/basemodel_16k/sambert/config.yaml new file mode 100644 index 0000000..8a16420 --- /dev/null +++ b/basemodel_16k/sambert/config.yaml @@ -0,0 +1,79 @@ +Loss: + MelReconLoss: + enable: true + params: {loss_type: mae} + ProsodyReconLoss: + enable: true + params: {loss_type: mae} +Model: + KanTtsSAMBERT: + optimizer: + params: + betas: [0.9, 0.98] + eps: 1.0e-09 + lr: 0.001 + weight_decay: 0.0 + type: Adam + params: + MAS: false + decoder_attention_dropout: 0.1 + decoder_dropout: 0.1 + decoder_ffn_inner_dim: 1024 + decoder_num_heads: 8 + decoder_num_layers: 12 + decoder_num_units: 128 + decoder_prenet_units: [256, 256] + decoder_relu_dropout: 0.1 + dur_pred_lstm_units: 128 + dur_pred_prenet_units: [128, 128] + embedding_dim: 512 + emotion_units: 32 + encoder_attention_dropout: 0.1 + encoder_dropout: 0.1 + encoder_ffn_inner_dim: 1024 + encoder_num_heads: 8 + encoder_num_layers: 8 + encoder_num_units: 128 + encoder_projection_units: 32 + encoder_relu_dropout: 0.1 + max_len: 800 + num_mels: 80 + outputs_per_step: 3 + postnet_dropout: 0.1 + postnet_ffn_inner_dim: 512 + postnet_filter_size: 41 + postnet_fsmn_num_layers: 4 + postnet_lstm_units: 128 + postnet_num_memory_units: 256 + postnet_shift: 17 + predictor_dropout: 0.1 + predictor_ffn_inner_dim: 256 + predictor_filter_size: 41 + predictor_fsmn_num_layers: 3 + predictor_lstm_units: 128 + predictor_num_memory_units: 128 + predictor_shift: 0 + speaker_units: 32 + scheduler: + params: {warmup_steps: 4000} + type: NoamLR +allow_cache: true +audio_config: {fmax: 8000.0, fmin: 0.0, hop_length: 200, max_norm: 1.0, min_level_db: -100.0, + n_fft: 2048, n_mels: 80, norm_type: mean_std, num_workers: 16, phone_level_feature: true, + preemphasize: false, ref_level_db: 20, sampling_rate: 16000, symmetric: false, trim_silence: true, + trim_silence_threshold_db: 60, wav_normalize: true, win_length: 1000} +batch_size: 32 +create_time: '2022-12-26 11:05:43' +eval_interval_steps: 10000 +git_revision_hash: 388243c0c173756d1eb34783c02cec4c302cdc25 +grad_norm: 1.0 +linguistic_unit: {cleaners: english_cleaners, language: Sichuan, lfeat_type_list: 'sy,tone,syllable_flag,word_segment,emo_category,speaker_category', + speaker_list: xiaoyue} +log_interval_steps: 1000 +model_type: sambert +num_save_intermediate_results: 4 +num_workers: 4 +pin_memory: false +remove_short_samples: false +save_interval_steps: 20000 +train_max_steps: 1000000 diff --git a/configuration.json b/configuration.json new file mode 100644 index 0000000..c8a1fc5 --- /dev/null +++ b/configuration.json @@ -0,0 +1,129 @@ +{ + "framework": "Tensorflow", + "task" : "text-to-speech", + "model" : { + "type" : "sambert-hifigan", + "lang_type" : "zhcn", + "sample_rate" : 16000, + "am": { + "am": { + "max_len": 800, + + "embedding_dim": 512, + "encoder_num_layers": 8, + "encoder_num_heads": 8, + "encoder_num_units": 128, + "encoder_ffn_inner_dim": 1024, + "encoder_dropout": 0.1, + "encoder_attention_dropout": 0.1, + "encoder_relu_dropout": 0.1, + "encoder_projection_units": 32, + + "speaker_units": 32, + "emotion_units": 32, + + "predictor_filter_size": 41, + "predictor_fsmn_num_layers": 3, + "predictor_num_memory_units": 128, + "predictor_ffn_inner_dim": 256, + "predictor_dropout": 0.1, + "predictor_shift": 0, + "predictor_lstm_units": 128, + "dur_pred_prenet_units": [128, 128], + "dur_pred_lstm_units": 128, + + "decoder_prenet_units": [256, 256], + "decoder_num_layers": 12, + "decoder_num_heads": 8, + "decoder_num_units": 128, + "decoder_ffn_inner_dim": 1024, + "decoder_dropout": 0.1, + "decoder_attention_dropout": 0.1, + "decoder_relu_dropout": 0.1, + + "outputs_per_step": 3, + "num_mels": 80, + + "postnet_filter_size": 41, + "postnet_fsmn_num_layers": 4, + "postnet_num_memory_units": 256, + "postnet_ffn_inner_dim": 512, + "postnet_dropout": 0.1, + "postnet_shift": 17, + "postnet_lstm_units": 128 + }, + + "audio": { + "frame_shift_ms": 12.5 + }, + + "linguistic_unit": { + "cleaners": "english_cleaners", + "lfeat_type_list": "sy,tone,syllable_flag,word_segment,emo_category,speaker_category", + "sy": "dict/sy_dict.txt", + "tone": "dict/tone_dict.txt", + "syllable_flag": "dict/syllable_flag_dict.txt", + "word_segment": "dict/word_segment_dict.txt", + "emo_category": "dict/emo_category_dict.txt", + "speaker_category": "dict/speaker_dict.txt" + }, + + "num_gpus": 1, + "batch_size": 32, + "group_size": 1024, + "learning_rate": 0.001, + "adam_b1": 0.9, + "adam_b2": 0.98, + "seed": 1234, + + "num_workers": 4, + + "dist_config": { + "dist_backend": "nccl", + "dist_url": "tcp://localhost:11111", + "world_size": 1 + } + + }, + "vocoder" : { + "resblock": "1", + "num_gpus": 1, + "batch_size": 16, + "learning_rate": 0.0002, + "adam_b1": 0.8, + "adam_b2": 0.99, + "lr_decay": 0.999, + "seed": 1234, + + "upsample_rates": [10,5,2,2], + "upsample_kernel_sizes": [20,10,4,4], + "upsample_initial_channel": 256, + "resblock_kernel_sizes": [3,7,11], + "resblock_dilation_sizes": [[1,3,5,7], [1,3,5,7], [1,3,5,7]], + + "segment_size": 6400, + "num_mels": 80, + "num_freq": 1025, + "n_fft": 2048, + "hop_size": 200, + "win_size": 1000, + + "sampling_rate": 16000, + + "fmin": 0, + "fmax": 8000, + "fmax_for_loss": null, + + "num_workers": 4, + + "dist_config": { + "dist_backend": "nccl", + "dist_url": "tcp://localhost:54312", + "world_size": 1 + } + } + }, + "pipeline": { + "type": "sambert-hifigan-tts" + } +} diff --git a/description/hifigan.jpg b/description/hifigan.jpg new file mode 100644 index 0000000..1d7cc17 Binary files /dev/null and b/description/hifigan.jpg differ diff --git a/description/sambert.jpg b/description/sambert.jpg new file mode 100644 index 0000000..7c9e97e Binary files /dev/null and b/description/sambert.jpg differ diff --git a/description/tts-system.jpg b/description/tts-system.jpg new file mode 100644 index 0000000..d3139f1 Binary files /dev/null and b/description/tts-system.jpg differ diff --git a/resource.zip b/resource.zip new file mode 100644 index 0000000..2d6f7b6 --- /dev/null +++ b/resource.zip @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3d7eb353f4a09aa7ef826145ff4dd7aaa80c6c970cc7055c9d016d67771ad0fe +size 247832832 diff --git a/voices.zip b/voices.zip new file mode 100644 index 0000000..7c5f091 --- /dev/null +++ b/voices.zip @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4e198a316c6235d15286639fbec1a73668d9957a46279e4e39661c69d6a9f41f +size 89586408