[add]Sichuan model pth

2026-07-16 13:32:54 +08:00 · 2023-01-03 16:06:54 +08:00
parent b280b5dbd7
commit 1b91ff809f
11 changed files with 518 additions and 6 deletions
--- a/README.md
+++ b/README.md
@ -1,9 +1,170 @@
 ---
+tasks:
+- text-to-speech
+domain:
+- audio
+frameworks:
+- pytorch
+backbone:
+- transformer
+metrics:
+- MOS
 license: Apache License 2.0
+tags:
+- Alibaba
+- tts
+- hifigan
+- sambert
+- text-to-speech
+- Sichuan
+- 16k
+widgets:
+  - task: text-to-speech
+    inputs:
+      - type: text 
+        name: input 
+        title: 文本 
+        validator: 
+          max_words: 300
+    examples:
+      - name: 1
+        title: 示例1 
+        inputs:
+          - name: input
+            data: 北京今天天气怎么样
+    inferencespec:
+      cpu: 4 #CPU数量
+      memory: 8192 
+      gpu: 1 #GPU数量
+      gpu_memory: 8192 
 ---
-###### 该模型当前使用的是默认介绍模版，处于“预发布”阶段，页面仅限所有者可见。
-###### 请根据[模型贡献文档说明](https://www.modelscope.cn/docs/%E5%A6%82%E4%BD%95%E6%92%B0%E5%86%99%E5%A5%BD%E7%94%A8%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%8D%A1%E7%89%87)，及时完善模型卡片内容。ModelScope平台将在模型卡片完善后展示。谢谢您的理解。
-#### Clone with HTTP
-```bash
- git clone https://www.modelscope.cn/speech_tts/speech_sambert-hifigan_tts_chuangirl_Sichuan_16k.git
+
+# Sambert-Hifigan模型介绍
+
+
+## 框架描述
+拼接法和参数法是两种Text-To-Speech(TTS)技术路线。近年来参数TTS系统获得了广泛的应用，故此处仅涉及参数法。
+
+参数TTS系统可分为两大模块：前端和后端。
+前端包含文本正则、分词、多音字预测、文本转音素和韵律预测等模块，它的功能是把输入文本进行解析，获得音素、音调、停顿和位置等语言学特征。
+后端包含时长模型、声学模型和声码器，它的功能是将语言学特征转换为语音。其中，时长模型的功能是给定语言学特征，获得每一个建模单元（例如:音素）的时长信息；声学模型则基于语言学特征和时长信息预测声学特征；声码器则将声学特征转换为对应的语音波形。
+
+其系统结构如[图1]所示：
+
+![系统结构](description/tts-system.jpg)
+
+前端模块我们采用模型结合规则的方式灵活处理各种场景下的文本，后端模块则采用SAM-BERT + HIFIGAN提供高表现力的流式合成效果。
+
+### 声学模型SAM-BERT
+后端模块中声学模型采用自研的SAM-BERT,将时长模型和声学模型联合进行建模。结构如[图2]所示
 ```
+1. Backbone采用Self-Attention-Mechanism(SAM)，提升模型建模能力。
+2. Encoder部分采用BERT进行初始化，引入更多文本信息，提升合成韵律。
+3. Variance Adaptor对音素级别的韵律(基频、能量、时长)轮廓进行粗粒度的预测，再通过decoder进行帧级别细粒度的建模;并在时长预测时考虑到其与基频、能量的关联信息，结合自回归结构，进一步提升韵律自然度.
+4. Decoder部分采用PNCA AR-Decoder[@li2020robutrans]，自然支持流式合成。
+```
+
+
+![SAMBERT结构](description/sambert.jpg)
+
+### 声码器模型:HIFI-GAN
+后端模块中声码器采用HIFI-GAN, 基于GAN的方式利用判别器(Discriminator)来指导声码器(即生成器Generator)的训练，相较于经典的自回归式逐样本点CE训练, 训练方式更加自然，在生成效率和效果上具有明显的优势。其系统结构如[图3]所示：
+
+![系统结构](description/hifigan.jpg)
+
+在HIFI-GAN开源工作[1]的基础上，我们针对16k, 48k采样率下的模型结构进行了调优设计，并提供了基于因果卷积的低时延流式生成和chunk流式生成机制，可与声学模型配合支持CPU、GPU等硬件条件下的实时流式合成。
+
+## 使用方式和范围
+
+使用方式：
+* 直接输入文本进行推理
+
+使用范围:
+* 适用于四川话的语音合成场景，输入文本使用utf-8编码，整体长度建议不超过30字
+
+目标场景:
+* 各种语音合成任务，比如配音，虚拟主播，数字人等
+
+### 如何使用
+目前仅支持Linux使用，暂不支持Windows及Mac使用。
+请结合[KAN-TTS](https://github.com/AlibabaResearch/KAN-TTS)代码进行finetune。具体使用方法参考：
+
+[sambert训练教程](https://github.com/AlibabaResearch/KAN-TTS/wiki/training_sambert)
+
+[hifigan训练教程](https://github.com/AlibabaResearch/KAN-TTS/wiki/training_hifigan)
+
+MaaS-lib暂未支持本模型训练，敬请期待。
+
+#### 代码范例
+```Python
+from scipy.io.wavfile import write
+
+from modelscope.outputs import OutputKeys
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+text = '待合成文本'
+model_id = 'speech_tts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k'
+sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id, model_revision='v1.0.0')
+output = sambert_hifigan_tts(input=text)
+pcm = output[OutputKeys.OUTPUT_PCM]
+write('output.wav', 16000, pcm)
+```
+
+### 模型局限性以及可能的偏差
+* 该发音人支持四川话，TN规则为中文
+
+
+## 训练数据介绍
+使用单一发音人，共计约11.2小时数据训练, 主要为四川话。
+
+## 模型训练流程
+模型所需训练数据格式为：音频(.wav), 文本标注(.txt), 音素时长标注(.interval),  随机初始化训练要求训练数据规模在2小时以上，对于2小时以下的数据集，需使用多人预训练模型进行参数初始化。其中，AM模型训练时间需要1～2天，Vocoder模型训练时间需要5～7天。
+
+### 预处理
+模型训练需对音频文件提取声学特征(梅尔频谱)；音素时长根据配置项中的帧长将时间单位转换成帧数；文本标注，根据配置项中的音素集、音调分类、边界分类转换成对应的one-hot编号；
+
+
+## 引用
+如果你觉得这个该模型对有所帮助，请考虑引用下面的相关的论文：
+
+```BibTeX
+@inproceedings{li2020robutrans,
+  title={Robutrans: A robust transformer-based text-to-speech model},
+  author={Li, Naihan and Liu, Yanqing and Wu, Yu and Liu, Shujie and Zhao, Sheng and Liu, Ming},
+  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
+  volume={34},
+  number={05},
+  pages={8228--8235},
+  year={2020}
+}
+```
+
+```BibTeX
+@article{devlin2018bert,
+  title={Bert: Pre-training of deep bidirectional transformers for language understanding},
+  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
+  journal={arXiv preprint arXiv:1810.04805},
+  year={2018}
+}
+```
+```BibTeX
+@article{kong2020hifi,
+  title={Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis},
+  author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung},
+  journal={Advances in Neural Information Processing Systems},
+  volume={33},
+  pages={17022--17033},
+  year={2020}
+}
+```
+
+本模型参考了以下实现
+- [1] [ming024's FastSpeech2 Implementation](https://github.com/ming024/FastSpeech2)
+- [2] [jik876/hifi-gan](https://github.com/jik876/hifi-gan)
+- [3] [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
+- [4] [mozilla/TTS](https://github.com/mozilla/TTS)
+- [5] [espnet/espnet](https://github.com/espnet/espnet)
+
+
+
--- a/basemodel_16k/hifigan/ckpt/checkpoint_340000.pth
+++ b/basemodel_16k/hifigan/ckpt/checkpoint_340000.pth
--- a/basemodel_16k/hifigan/config.yaml
+++ b/basemodel_16k/hifigan/config.yaml
@ -0,0 +1,131 @@
+Loss:
+  discriminator_adv_loss:
+    enable: true
+    params: {average_by_discriminators: false}
+    weights: 1.0
+  feat_match_loss:
+    enable: true
+    params: {average_by_discriminators: false, average_by_layers: false}
+    weights: 2.0
+  generator_adv_loss:
+    enable: true
+    params: {average_by_discriminators: false}
+    weights: 1.0
+  mel_loss:
+    enable: true
+    params: {fft_size: 2048, fmax: 8000, fmin: 0, fs: 16000, hop_size: 200, log_base: null,
+      num_mels: 80, win_length: 1000, window: hann}
+    weights: 45.0
+  stft_loss: {enable: false}
+  subband_stft_loss:
+    enable: false
+    params:
+      fft_sizes: [384, 683, 171]
+      hop_sizes: [35, 75, 15]
+      win_lengths: [150, 300, 60]
+      window: hann_window
+Model:
+  Generator:
+    optimizer:
+      params:
+        betas: [0.5, 0.9]
+        lr: 0.0002
+        weight_decay: 0.0
+      type: Adam
+    params:
+      bias: true
+      causal: false
+      channels: 256
+      in_channels: 80
+      kernel_size: 7
+      nonlinear_activation: LeakyReLU
+      nonlinear_activation_params: {negative_slope: 0.1}
+      out_channels: 1
+      resblock_dilations:
+      - [1, 3, 5, 7]
+      - [1, 3, 5, 7]
+      - [1, 3, 5, 7]
+      resblock_kernel_sizes: [3, 7, 11]
+      upsample_kernal_sizes: [20, 11, 4, 4]
+      upsample_scales: [10, 5, 2, 2]
+      use_weight_norm: true
+    scheduler:
+      params:
+        gamma: 0.5
+        milestones: [200000, 400000, 600000, 800000]
+      type: MultiStepLR
+  MultiPeriodDiscriminator:
+    optimizer:
+      params:
+        betas: [0.5, 0.9]
+        lr: 0.0002
+        weight_decay: 0.0
+      type: Adam
+    params:
+      discriminator_params:
+        bias: true
+        channels: 32
+        downsample_scales: [3, 3, 3, 3, 1]
+        in_channels: 1
+        kernel_sizes: [5, 3]
+        max_downsample_channels: 1024
+        nonlinear_activation: LeakyReLU
+        nonlinear_activation_params: {negative_slope: 0.1}
+        out_channels: 1
+        use_spectral_norm: false
+      periods: [2, 3, 5, 7, 11]
+    scheduler:
+      params:
+        gamma: 0.5
+        milestones: [200000, 400000, 600000, 800000]
+      type: MultiStepLR
+  MultiScaleDiscriminator:
+    optimizer:
+      params:
+        betas: [0.5, 0.9]
+        lr: 0.0002
+        weight_decay: 0.0
+      type: Adam
+    params:
+      discriminator_params:
+        bias: true
+        channels: 128
+        downsample_scales: [4, 4, 4, 4, 1]
+        in_channels: 1
+        kernel_sizes: [15, 41, 5, 3]
+        max_downsample_channels: 1024
+        max_groups: 16
+        nonlinear_activation: LeakyReLU
+        nonlinear_activation_params: {negative_slope: 0.1}
+        out_channels: 1
+      downsample_pooling: DWT
+      downsample_pooling_params: {kernel_size: 4, padding: 2, stride: 2}
+      follow_official_norm: true
+      scales: 3
+    scheduler:
+      params:
+        gamma: 0.5
+        milestones: [200000, 400000, 600000, 800000]
+      type: MultiStepLR
+allow_cache: true
+audio_config: {fmax: 8000.0, fmin: 0.0, hop_length: 200, max_norm: 1.0, min_level_db: -100.0,
+  n_fft: 2048, n_mels: 80, norm_type: mean_std, num_workers: 16, phone_level_feature: true,
+  preemphasize: false, ref_level_db: 20, sampling_rate: 16000, symmetric: false, trim_silence: true,
+  trim_silence_threshold_db: 60, wav_normalize: true, win_length: 1000}
+batch_max_steps: 9600
+batch_size: 16
+create_time: '2022-12-26 11:11:35'
+discriminator_grad_norm: -1
+discriminator_train_start_steps: 0
+eval_interval_steps: 10000
+generator_grad_norm: -1
+generator_train_start_steps: 1
+git_revision_hash: 388243c0c173756d1eb34783c02cec4c302cdc25
+log_interval_steps: 1000
+model_type: hifigan
+num_save_intermediate_results: 4
+num_workers: 2
+pin_memory: true
+remove_short_samples: false
+save_interval_steps: 20000
+train_max_steps: 2500000
--- a/basemodel_16k/sambert/ckpt/checkpoint_980000.pth
+++ b/basemodel_16k/sambert/ckpt/checkpoint_980000.pth
--- a/basemodel_16k/sambert/config.yaml
+++ b/basemodel_16k/sambert/config.yaml
@ -0,0 +1,79 @@
+Loss:
+  MelReconLoss:
+    enable: true
+    params: {loss_type: mae}
+  ProsodyReconLoss:
+    enable: true
+    params: {loss_type: mae}
+Model:
+  KanTtsSAMBERT:
+    optimizer:
+      params:
+        betas: [0.9, 0.98]
+        eps: 1.0e-09
+        lr: 0.001
+        weight_decay: 0.0
+      type: Adam
+    params:
+      MAS: false
+      decoder_attention_dropout: 0.1
+      decoder_dropout: 0.1
+      decoder_ffn_inner_dim: 1024
+      decoder_num_heads: 8
+      decoder_num_layers: 12
+      decoder_num_units: 128
+      decoder_prenet_units: [256, 256]
+      decoder_relu_dropout: 0.1
+      dur_pred_lstm_units: 128
+      dur_pred_prenet_units: [128, 128]
+      embedding_dim: 512
+      emotion_units: 32
+      encoder_attention_dropout: 0.1
+      encoder_dropout: 0.1
+      encoder_ffn_inner_dim: 1024
+      encoder_num_heads: 8
+      encoder_num_layers: 8
+      encoder_num_units: 128
+      encoder_projection_units: 32
+      encoder_relu_dropout: 0.1
+      max_len: 800
+      num_mels: 80
+      outputs_per_step: 3
+      postnet_dropout: 0.1
+      postnet_ffn_inner_dim: 512
+      postnet_filter_size: 41
+      postnet_fsmn_num_layers: 4
+      postnet_lstm_units: 128
+      postnet_num_memory_units: 256
+      postnet_shift: 17
+      predictor_dropout: 0.1
+      predictor_ffn_inner_dim: 256
+      predictor_filter_size: 41
+      predictor_fsmn_num_layers: 3
+      predictor_lstm_units: 128
+      predictor_num_memory_units: 128
+      predictor_shift: 0
+      speaker_units: 32
+    scheduler:
+      params: {warmup_steps: 4000}
+      type: NoamLR
+allow_cache: true
+audio_config: {fmax: 8000.0, fmin: 0.0, hop_length: 200, max_norm: 1.0, min_level_db: -100.0,
+  n_fft: 2048, n_mels: 80, norm_type: mean_std, num_workers: 16, phone_level_feature: true,
+  preemphasize: false, ref_level_db: 20, sampling_rate: 16000, symmetric: false, trim_silence: true,
+  trim_silence_threshold_db: 60, wav_normalize: true, win_length: 1000}
+batch_size: 32
+create_time: '2022-12-26 11:05:43'
+eval_interval_steps: 10000
+git_revision_hash: 388243c0c173756d1eb34783c02cec4c302cdc25
+grad_norm: 1.0
+linguistic_unit: {cleaners: english_cleaners, language: Sichuan, lfeat_type_list: 'sy,tone,syllable_flag,word_segment,emo_category,speaker_category',
+  speaker_list: xiaoyue}
+log_interval_steps: 1000
+model_type: sambert
+num_save_intermediate_results: 4
+num_workers: 4
+pin_memory: false
+remove_short_samples: false
+save_interval_steps: 20000
+train_max_steps: 1000000
--- a/configuration.json
+++ b/configuration.json
@ -0,0 +1,129 @@
+{
+  "framework": "Tensorflow",
+  "task" : "text-to-speech",
+  "model" : {
+    "type" : "sambert-hifigan",
+    "lang_type" : "zhcn",
+    "sample_rate" : 16000,
+    "am": {
+       "am": {
+          "max_len": 800,
+
+          "embedding_dim": 512, 
+          "encoder_num_layers": 8,
+          "encoder_num_heads": 8,
+          "encoder_num_units": 128,
+          "encoder_ffn_inner_dim": 1024,
+          "encoder_dropout": 0.1,
+          "encoder_attention_dropout": 0.1,
+          "encoder_relu_dropout": 0.1,
+          "encoder_projection_units": 32,
+
+          "speaker_units": 32,
+          "emotion_units": 32,
+
+          "predictor_filter_size": 41,
+          "predictor_fsmn_num_layers": 3,
+          "predictor_num_memory_units": 128,
+          "predictor_ffn_inner_dim": 256,
+          "predictor_dropout": 0.1,
+          "predictor_shift": 0,
+          "predictor_lstm_units": 128,
+          "dur_pred_prenet_units": [128, 128],
+          "dur_pred_lstm_units": 128,
+
+          "decoder_prenet_units": [256, 256],
+          "decoder_num_layers": 12,
+          "decoder_num_heads": 8,
+          "decoder_num_units": 128,
+          "decoder_ffn_inner_dim": 1024,
+          "decoder_dropout": 0.1,
+          "decoder_attention_dropout": 0.1,
+          "decoder_relu_dropout": 0.1,
+
+          "outputs_per_step": 3,
+          "num_mels": 80,
+
+          "postnet_filter_size": 41,
+          "postnet_fsmn_num_layers": 4,
+          "postnet_num_memory_units": 256,
+          "postnet_ffn_inner_dim": 512,
+          "postnet_dropout": 0.1,
+          "postnet_shift": 17,
+          "postnet_lstm_units": 128
+      },
+
+      "audio": {
+          "frame_shift_ms": 12.5
+      },
+
+      "linguistic_unit": {
+        "cleaners": "english_cleaners",
+        "lfeat_type_list": "sy,tone,syllable_flag,word_segment,emo_category,speaker_category",
+        "sy": "dict/sy_dict.txt",
+        "tone": "dict/tone_dict.txt",
+        "syllable_flag": "dict/syllable_flag_dict.txt",
+        "word_segment": "dict/word_segment_dict.txt",
+        "emo_category": "dict/emo_category_dict.txt",
+        "speaker_category": "dict/speaker_dict.txt"
+      },
+
+      "num_gpus": 1,
+      "batch_size": 32,
+      "group_size": 1024,
+      "learning_rate": 0.001,
+      "adam_b1": 0.9,
+      "adam_b2": 0.98,
+      "seed": 1234,
+
+      "num_workers": 4,
+
+      "dist_config": {
+          "dist_backend": "nccl",
+          "dist_url": "tcp://localhost:11111",
+          "world_size": 1
+      }
+
+    },
+    "vocoder" : {
+      "resblock": "1",
+      "num_gpus": 1,
+      "batch_size": 16,
+      "learning_rate": 0.0002,
+      "adam_b1": 0.8,
+      "adam_b2": 0.99,
+      "lr_decay": 0.999,
+      "seed": 1234,
+
+      "upsample_rates": [10,5,2,2],
+      "upsample_kernel_sizes": [20,10,4,4],
+      "upsample_initial_channel": 256,
+      "resblock_kernel_sizes": [3,7,11],
+      "resblock_dilation_sizes": [[1,3,5,7], [1,3,5,7], [1,3,5,7]],
+
+      "segment_size": 6400,
+      "num_mels": 80,
+      "num_freq": 1025,
+      "n_fft": 2048,
+      "hop_size": 200,
+      "win_size": 1000,
+
+      "sampling_rate": 16000,
+
+      "fmin": 0,
+      "fmax": 8000,
+      "fmax_for_loss": null,
+
+      "num_workers": 4,
+
+      "dist_config": {
+          "dist_backend": "nccl",
+          "dist_url": "tcp://localhost:54312",
+          "world_size": 1
+      }
+    }
+  },
+  "pipeline": {
+     "type": "sambert-hifigan-tts"
+  }
+}
--- a/description/hifigan.jpg
+++ b/description/hifigan.jpg
--- a/description/sambert.jpg
+++ b/description/sambert.jpg
--- a/description/tts-system.jpg
+++ b/description/tts-system.jpg
--- a/resource.zip
+++ b/resource.zip
--- a/voices.zip
+++ b/voices.zip