mirror of
https://www.modelscope.cn/speech_tts/speech_sambert-hifigan_tts_chuangirl_Sichuan_16k.git
synced 2026-04-02 18:42:52 +08:00
[add]Sichuan model pth
This commit is contained in:
171
README.md
171
README.md
@ -1,9 +1,170 @@
|
||||
---
|
||||
tasks:
|
||||
- text-to-speech
|
||||
domain:
|
||||
- audio
|
||||
frameworks:
|
||||
- pytorch
|
||||
backbone:
|
||||
- transformer
|
||||
metrics:
|
||||
- MOS
|
||||
license: Apache License 2.0
|
||||
tags:
|
||||
- Alibaba
|
||||
- tts
|
||||
- hifigan
|
||||
- sambert
|
||||
- text-to-speech
|
||||
- Sichuan
|
||||
- 16k
|
||||
widgets:
|
||||
- task: text-to-speech
|
||||
inputs:
|
||||
- type: text
|
||||
name: input
|
||||
title: 文本
|
||||
validator:
|
||||
max_words: 300
|
||||
examples:
|
||||
- name: 1
|
||||
title: 示例1
|
||||
inputs:
|
||||
- name: input
|
||||
data: 北京今天天气怎么样
|
||||
inferencespec:
|
||||
cpu: 4 #CPU数量
|
||||
memory: 8192
|
||||
gpu: 1 #GPU数量
|
||||
gpu_memory: 8192
|
||||
---
|
||||
###### 该模型当前使用的是默认介绍模版,处于“预发布”阶段,页面仅限所有者可见。
|
||||
###### 请根据[模型贡献文档说明](https://www.modelscope.cn/docs/%E5%A6%82%E4%BD%95%E6%92%B0%E5%86%99%E5%A5%BD%E7%94%A8%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%8D%A1%E7%89%87),及时完善模型卡片内容。ModelScope平台将在模型卡片完善后展示。谢谢您的理解。
|
||||
#### Clone with HTTP
|
||||
```bash
|
||||
git clone https://www.modelscope.cn/speech_tts/speech_sambert-hifigan_tts_chuangirl_Sichuan_16k.git
|
||||
|
||||
# Sambert-Hifigan模型介绍
|
||||
|
||||
|
||||
## 框架描述
|
||||
拼接法和参数法是两种Text-To-Speech(TTS)技术路线。近年来参数TTS系统获得了广泛的应用,故此处仅涉及参数法。
|
||||
|
||||
参数TTS系统可分为两大模块:前端和后端。
|
||||
前端包含文本正则、分词、多音字预测、文本转音素和韵律预测等模块,它的功能是把输入文本进行解析,获得音素、音调、停顿和位置等语言学特征。
|
||||
后端包含时长模型、声学模型和声码器,它的功能是将语言学特征转换为语音。其中,时长模型的功能是给定语言学特征,获得每一个建模单元(例如:音素)的时长信息;声学模型则基于语言学特征和时长信息预测声学特征;声码器则将声学特征转换为对应的语音波形。
|
||||
|
||||
其系统结构如[图1]所示:
|
||||
|
||||

|
||||
|
||||
前端模块我们采用模型结合规则的方式灵活处理各种场景下的文本,后端模块则采用SAM-BERT + HIFIGAN提供高表现力的流式合成效果。
|
||||
|
||||
### 声学模型SAM-BERT
|
||||
后端模块中声学模型采用自研的SAM-BERT,将时长模型和声学模型联合进行建模。结构如[图2]所示
|
||||
```
|
||||
1. Backbone采用Self-Attention-Mechanism(SAM),提升模型建模能力。
|
||||
2. Encoder部分采用BERT进行初始化,引入更多文本信息,提升合成韵律。
|
||||
3. Variance Adaptor对音素级别的韵律(基频、能量、时长)轮廓进行粗粒度的预测,再通过decoder进行帧级别细粒度的建模;并在时长预测时考虑到其与基频、能量的关联信息,结合自回归结构,进一步提升韵律自然度.
|
||||
4. Decoder部分采用PNCA AR-Decoder[@li2020robutrans],自然支持流式合成。
|
||||
```
|
||||
|
||||
|
||||

|
||||
|
||||
### 声码器模型:HIFI-GAN
|
||||
后端模块中声码器采用HIFI-GAN, 基于GAN的方式利用判别器(Discriminator)来指导声码器(即生成器Generator)的训练,相较于经典的自回归式逐样本点CE训练, 训练方式更加自然,在生成效率和效果上具有明显的优势。其系统结构如[图3]所示:
|
||||
|
||||

|
||||
|
||||
在HIFI-GAN开源工作[1]的基础上,我们针对16k, 48k采样率下的模型结构进行了调优设计,并提供了基于因果卷积的低时延流式生成和chunk流式生成机制,可与声学模型配合支持CPU、GPU等硬件条件下的实时流式合成。
|
||||
|
||||
## 使用方式和范围
|
||||
|
||||
使用方式:
|
||||
* 直接输入文本进行推理
|
||||
|
||||
使用范围:
|
||||
* 适用于四川话的语音合成场景,输入文本使用utf-8编码,整体长度建议不超过30字
|
||||
|
||||
目标场景:
|
||||
* 各种语音合成任务,比如配音,虚拟主播,数字人等
|
||||
|
||||
### 如何使用
|
||||
目前仅支持Linux使用,暂不支持Windows及Mac使用。
|
||||
请结合[KAN-TTS](https://github.com/AlibabaResearch/KAN-TTS)代码进行finetune。具体使用方法参考:
|
||||
|
||||
[sambert训练教程](https://github.com/AlibabaResearch/KAN-TTS/wiki/training_sambert)
|
||||
|
||||
[hifigan训练教程](https://github.com/AlibabaResearch/KAN-TTS/wiki/training_hifigan)
|
||||
|
||||
MaaS-lib暂未支持本模型训练,敬请期待。
|
||||
|
||||
#### 代码范例
|
||||
```Python
|
||||
from scipy.io.wavfile import write
|
||||
|
||||
from modelscope.outputs import OutputKeys
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
text = '待合成文本'
|
||||
model_id = 'speech_tts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k'
|
||||
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id, model_revision='v1.0.0')
|
||||
output = sambert_hifigan_tts(input=text)
|
||||
pcm = output[OutputKeys.OUTPUT_PCM]
|
||||
write('output.wav', 16000, pcm)
|
||||
```
|
||||
|
||||
### 模型局限性以及可能的偏差
|
||||
* 该发音人支持四川话,TN规则为中文
|
||||
|
||||
|
||||
## 训练数据介绍
|
||||
使用单一发音人,共计约11.2小时数据训练, 主要为四川话。
|
||||
|
||||
## 模型训练流程
|
||||
模型所需训练数据格式为:音频(.wav), 文本标注(.txt), 音素时长标注(.interval), 随机初始化训练要求训练数据规模在2小时以上,对于2小时以下的数据集,需使用多人预训练模型进行参数初始化。其中,AM模型训练时间需要1~2天,Vocoder模型训练时间需要5~7天。
|
||||
|
||||
### 预处理
|
||||
模型训练需对音频文件提取声学特征(梅尔频谱);音素时长根据配置项中的帧长将时间单位转换成帧数;文本标注,根据配置项中的音素集、音调分类、边界分类转换成对应的one-hot编号;
|
||||
|
||||
|
||||
## 引用
|
||||
如果你觉得这个该模型对有所帮助,请考虑引用下面的相关的论文:
|
||||
|
||||
```BibTeX
|
||||
@inproceedings{li2020robutrans,
|
||||
title={Robutrans: A robust transformer-based text-to-speech model},
|
||||
author={Li, Naihan and Liu, Yanqing and Wu, Yu and Liu, Shujie and Zhao, Sheng and Liu, Ming},
|
||||
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
|
||||
volume={34},
|
||||
number={05},
|
||||
pages={8228--8235},
|
||||
year={2020}
|
||||
}
|
||||
```
|
||||
|
||||
```BibTeX
|
||||
@article{devlin2018bert,
|
||||
title={Bert: Pre-training of deep bidirectional transformers for language understanding},
|
||||
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
|
||||
journal={arXiv preprint arXiv:1810.04805},
|
||||
year={2018}
|
||||
}
|
||||
```
|
||||
```BibTeX
|
||||
@article{kong2020hifi,
|
||||
title={Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis},
|
||||
author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung},
|
||||
journal={Advances in Neural Information Processing Systems},
|
||||
volume={33},
|
||||
pages={17022--17033},
|
||||
year={2020}
|
||||
}
|
||||
```
|
||||
|
||||
本模型参考了以下实现
|
||||
- [1] [ming024's FastSpeech2 Implementation](https://github.com/ming024/FastSpeech2)
|
||||
- [2] [jik876/hifi-gan](https://github.com/jik876/hifi-gan)
|
||||
- [3] [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
|
||||
- [4] [mozilla/TTS](https://github.com/mozilla/TTS)
|
||||
- [5] [espnet/espnet](https://github.com/espnet/espnet)
|
||||
|
||||
|
||||
|
||||
|
||||
BIN
basemodel_16k/hifigan/ckpt/checkpoint_340000.pth
(Stored with Git LFS)
Normal file
BIN
basemodel_16k/hifigan/ckpt/checkpoint_340000.pth
(Stored with Git LFS)
Normal file
Binary file not shown.
131
basemodel_16k/hifigan/config.yaml
Normal file
131
basemodel_16k/hifigan/config.yaml
Normal file
@ -0,0 +1,131 @@
|
||||
Loss:
|
||||
discriminator_adv_loss:
|
||||
enable: true
|
||||
params: {average_by_discriminators: false}
|
||||
weights: 1.0
|
||||
feat_match_loss:
|
||||
enable: true
|
||||
params: {average_by_discriminators: false, average_by_layers: false}
|
||||
weights: 2.0
|
||||
generator_adv_loss:
|
||||
enable: true
|
||||
params: {average_by_discriminators: false}
|
||||
weights: 1.0
|
||||
mel_loss:
|
||||
enable: true
|
||||
params: {fft_size: 2048, fmax: 8000, fmin: 0, fs: 16000, hop_size: 200, log_base: null,
|
||||
num_mels: 80, win_length: 1000, window: hann}
|
||||
weights: 45.0
|
||||
stft_loss: {enable: false}
|
||||
subband_stft_loss:
|
||||
enable: false
|
||||
params:
|
||||
fft_sizes: [384, 683, 171]
|
||||
hop_sizes: [35, 75, 15]
|
||||
win_lengths: [150, 300, 60]
|
||||
window: hann_window
|
||||
Model:
|
||||
Generator:
|
||||
optimizer:
|
||||
params:
|
||||
betas: [0.5, 0.9]
|
||||
lr: 0.0002
|
||||
weight_decay: 0.0
|
||||
type: Adam
|
||||
params:
|
||||
bias: true
|
||||
causal: false
|
||||
channels: 256
|
||||
in_channels: 80
|
||||
kernel_size: 7
|
||||
nonlinear_activation: LeakyReLU
|
||||
nonlinear_activation_params: {negative_slope: 0.1}
|
||||
out_channels: 1
|
||||
resblock_dilations:
|
||||
- [1, 3, 5, 7]
|
||||
- [1, 3, 5, 7]
|
||||
- [1, 3, 5, 7]
|
||||
resblock_kernel_sizes: [3, 7, 11]
|
||||
upsample_kernal_sizes: [20, 11, 4, 4]
|
||||
upsample_scales: [10, 5, 2, 2]
|
||||
use_weight_norm: true
|
||||
scheduler:
|
||||
params:
|
||||
gamma: 0.5
|
||||
milestones: [200000, 400000, 600000, 800000]
|
||||
type: MultiStepLR
|
||||
MultiPeriodDiscriminator:
|
||||
optimizer:
|
||||
params:
|
||||
betas: [0.5, 0.9]
|
||||
lr: 0.0002
|
||||
weight_decay: 0.0
|
||||
type: Adam
|
||||
params:
|
||||
discriminator_params:
|
||||
bias: true
|
||||
channels: 32
|
||||
downsample_scales: [3, 3, 3, 3, 1]
|
||||
in_channels: 1
|
||||
kernel_sizes: [5, 3]
|
||||
max_downsample_channels: 1024
|
||||
nonlinear_activation: LeakyReLU
|
||||
nonlinear_activation_params: {negative_slope: 0.1}
|
||||
out_channels: 1
|
||||
use_spectral_norm: false
|
||||
periods: [2, 3, 5, 7, 11]
|
||||
scheduler:
|
||||
params:
|
||||
gamma: 0.5
|
||||
milestones: [200000, 400000, 600000, 800000]
|
||||
type: MultiStepLR
|
||||
MultiScaleDiscriminator:
|
||||
optimizer:
|
||||
params:
|
||||
betas: [0.5, 0.9]
|
||||
lr: 0.0002
|
||||
weight_decay: 0.0
|
||||
type: Adam
|
||||
params:
|
||||
discriminator_params:
|
||||
bias: true
|
||||
channels: 128
|
||||
downsample_scales: [4, 4, 4, 4, 1]
|
||||
in_channels: 1
|
||||
kernel_sizes: [15, 41, 5, 3]
|
||||
max_downsample_channels: 1024
|
||||
max_groups: 16
|
||||
nonlinear_activation: LeakyReLU
|
||||
nonlinear_activation_params: {negative_slope: 0.1}
|
||||
out_channels: 1
|
||||
downsample_pooling: DWT
|
||||
downsample_pooling_params: {kernel_size: 4, padding: 2, stride: 2}
|
||||
follow_official_norm: true
|
||||
scales: 3
|
||||
scheduler:
|
||||
params:
|
||||
gamma: 0.5
|
||||
milestones: [200000, 400000, 600000, 800000]
|
||||
type: MultiStepLR
|
||||
allow_cache: true
|
||||
audio_config: {fmax: 8000.0, fmin: 0.0, hop_length: 200, max_norm: 1.0, min_level_db: -100.0,
|
||||
n_fft: 2048, n_mels: 80, norm_type: mean_std, num_workers: 16, phone_level_feature: true,
|
||||
preemphasize: false, ref_level_db: 20, sampling_rate: 16000, symmetric: false, trim_silence: true,
|
||||
trim_silence_threshold_db: 60, wav_normalize: true, win_length: 1000}
|
||||
batch_max_steps: 9600
|
||||
batch_size: 16
|
||||
create_time: '2022-12-26 11:11:35'
|
||||
discriminator_grad_norm: -1
|
||||
discriminator_train_start_steps: 0
|
||||
eval_interval_steps: 10000
|
||||
generator_grad_norm: -1
|
||||
generator_train_start_steps: 1
|
||||
git_revision_hash: 388243c0c173756d1eb34783c02cec4c302cdc25
|
||||
log_interval_steps: 1000
|
||||
model_type: hifigan
|
||||
num_save_intermediate_results: 4
|
||||
num_workers: 2
|
||||
pin_memory: true
|
||||
remove_short_samples: false
|
||||
save_interval_steps: 20000
|
||||
train_max_steps: 2500000
|
||||
BIN
basemodel_16k/sambert/ckpt/checkpoint_980000.pth
(Stored with Git LFS)
Normal file
BIN
basemodel_16k/sambert/ckpt/checkpoint_980000.pth
(Stored with Git LFS)
Normal file
Binary file not shown.
79
basemodel_16k/sambert/config.yaml
Normal file
79
basemodel_16k/sambert/config.yaml
Normal file
@ -0,0 +1,79 @@
|
||||
Loss:
|
||||
MelReconLoss:
|
||||
enable: true
|
||||
params: {loss_type: mae}
|
||||
ProsodyReconLoss:
|
||||
enable: true
|
||||
params: {loss_type: mae}
|
||||
Model:
|
||||
KanTtsSAMBERT:
|
||||
optimizer:
|
||||
params:
|
||||
betas: [0.9, 0.98]
|
||||
eps: 1.0e-09
|
||||
lr: 0.001
|
||||
weight_decay: 0.0
|
||||
type: Adam
|
||||
params:
|
||||
MAS: false
|
||||
decoder_attention_dropout: 0.1
|
||||
decoder_dropout: 0.1
|
||||
decoder_ffn_inner_dim: 1024
|
||||
decoder_num_heads: 8
|
||||
decoder_num_layers: 12
|
||||
decoder_num_units: 128
|
||||
decoder_prenet_units: [256, 256]
|
||||
decoder_relu_dropout: 0.1
|
||||
dur_pred_lstm_units: 128
|
||||
dur_pred_prenet_units: [128, 128]
|
||||
embedding_dim: 512
|
||||
emotion_units: 32
|
||||
encoder_attention_dropout: 0.1
|
||||
encoder_dropout: 0.1
|
||||
encoder_ffn_inner_dim: 1024
|
||||
encoder_num_heads: 8
|
||||
encoder_num_layers: 8
|
||||
encoder_num_units: 128
|
||||
encoder_projection_units: 32
|
||||
encoder_relu_dropout: 0.1
|
||||
max_len: 800
|
||||
num_mels: 80
|
||||
outputs_per_step: 3
|
||||
postnet_dropout: 0.1
|
||||
postnet_ffn_inner_dim: 512
|
||||
postnet_filter_size: 41
|
||||
postnet_fsmn_num_layers: 4
|
||||
postnet_lstm_units: 128
|
||||
postnet_num_memory_units: 256
|
||||
postnet_shift: 17
|
||||
predictor_dropout: 0.1
|
||||
predictor_ffn_inner_dim: 256
|
||||
predictor_filter_size: 41
|
||||
predictor_fsmn_num_layers: 3
|
||||
predictor_lstm_units: 128
|
||||
predictor_num_memory_units: 128
|
||||
predictor_shift: 0
|
||||
speaker_units: 32
|
||||
scheduler:
|
||||
params: {warmup_steps: 4000}
|
||||
type: NoamLR
|
||||
allow_cache: true
|
||||
audio_config: {fmax: 8000.0, fmin: 0.0, hop_length: 200, max_norm: 1.0, min_level_db: -100.0,
|
||||
n_fft: 2048, n_mels: 80, norm_type: mean_std, num_workers: 16, phone_level_feature: true,
|
||||
preemphasize: false, ref_level_db: 20, sampling_rate: 16000, symmetric: false, trim_silence: true,
|
||||
trim_silence_threshold_db: 60, wav_normalize: true, win_length: 1000}
|
||||
batch_size: 32
|
||||
create_time: '2022-12-26 11:05:43'
|
||||
eval_interval_steps: 10000
|
||||
git_revision_hash: 388243c0c173756d1eb34783c02cec4c302cdc25
|
||||
grad_norm: 1.0
|
||||
linguistic_unit: {cleaners: english_cleaners, language: Sichuan, lfeat_type_list: 'sy,tone,syllable_flag,word_segment,emo_category,speaker_category',
|
||||
speaker_list: xiaoyue}
|
||||
log_interval_steps: 1000
|
||||
model_type: sambert
|
||||
num_save_intermediate_results: 4
|
||||
num_workers: 4
|
||||
pin_memory: false
|
||||
remove_short_samples: false
|
||||
save_interval_steps: 20000
|
||||
train_max_steps: 1000000
|
||||
129
configuration.json
Normal file
129
configuration.json
Normal file
@ -0,0 +1,129 @@
|
||||
{
|
||||
"framework": "Tensorflow",
|
||||
"task" : "text-to-speech",
|
||||
"model" : {
|
||||
"type" : "sambert-hifigan",
|
||||
"lang_type" : "zhcn",
|
||||
"sample_rate" : 16000,
|
||||
"am": {
|
||||
"am": {
|
||||
"max_len": 800,
|
||||
|
||||
"embedding_dim": 512,
|
||||
"encoder_num_layers": 8,
|
||||
"encoder_num_heads": 8,
|
||||
"encoder_num_units": 128,
|
||||
"encoder_ffn_inner_dim": 1024,
|
||||
"encoder_dropout": 0.1,
|
||||
"encoder_attention_dropout": 0.1,
|
||||
"encoder_relu_dropout": 0.1,
|
||||
"encoder_projection_units": 32,
|
||||
|
||||
"speaker_units": 32,
|
||||
"emotion_units": 32,
|
||||
|
||||
"predictor_filter_size": 41,
|
||||
"predictor_fsmn_num_layers": 3,
|
||||
"predictor_num_memory_units": 128,
|
||||
"predictor_ffn_inner_dim": 256,
|
||||
"predictor_dropout": 0.1,
|
||||
"predictor_shift": 0,
|
||||
"predictor_lstm_units": 128,
|
||||
"dur_pred_prenet_units": [128, 128],
|
||||
"dur_pred_lstm_units": 128,
|
||||
|
||||
"decoder_prenet_units": [256, 256],
|
||||
"decoder_num_layers": 12,
|
||||
"decoder_num_heads": 8,
|
||||
"decoder_num_units": 128,
|
||||
"decoder_ffn_inner_dim": 1024,
|
||||
"decoder_dropout": 0.1,
|
||||
"decoder_attention_dropout": 0.1,
|
||||
"decoder_relu_dropout": 0.1,
|
||||
|
||||
"outputs_per_step": 3,
|
||||
"num_mels": 80,
|
||||
|
||||
"postnet_filter_size": 41,
|
||||
"postnet_fsmn_num_layers": 4,
|
||||
"postnet_num_memory_units": 256,
|
||||
"postnet_ffn_inner_dim": 512,
|
||||
"postnet_dropout": 0.1,
|
||||
"postnet_shift": 17,
|
||||
"postnet_lstm_units": 128
|
||||
},
|
||||
|
||||
"audio": {
|
||||
"frame_shift_ms": 12.5
|
||||
},
|
||||
|
||||
"linguistic_unit": {
|
||||
"cleaners": "english_cleaners",
|
||||
"lfeat_type_list": "sy,tone,syllable_flag,word_segment,emo_category,speaker_category",
|
||||
"sy": "dict/sy_dict.txt",
|
||||
"tone": "dict/tone_dict.txt",
|
||||
"syllable_flag": "dict/syllable_flag_dict.txt",
|
||||
"word_segment": "dict/word_segment_dict.txt",
|
||||
"emo_category": "dict/emo_category_dict.txt",
|
||||
"speaker_category": "dict/speaker_dict.txt"
|
||||
},
|
||||
|
||||
"num_gpus": 1,
|
||||
"batch_size": 32,
|
||||
"group_size": 1024,
|
||||
"learning_rate": 0.001,
|
||||
"adam_b1": 0.9,
|
||||
"adam_b2": 0.98,
|
||||
"seed": 1234,
|
||||
|
||||
"num_workers": 4,
|
||||
|
||||
"dist_config": {
|
||||
"dist_backend": "nccl",
|
||||
"dist_url": "tcp://localhost:11111",
|
||||
"world_size": 1
|
||||
}
|
||||
|
||||
},
|
||||
"vocoder" : {
|
||||
"resblock": "1",
|
||||
"num_gpus": 1,
|
||||
"batch_size": 16,
|
||||
"learning_rate": 0.0002,
|
||||
"adam_b1": 0.8,
|
||||
"adam_b2": 0.99,
|
||||
"lr_decay": 0.999,
|
||||
"seed": 1234,
|
||||
|
||||
"upsample_rates": [10,5,2,2],
|
||||
"upsample_kernel_sizes": [20,10,4,4],
|
||||
"upsample_initial_channel": 256,
|
||||
"resblock_kernel_sizes": [3,7,11],
|
||||
"resblock_dilation_sizes": [[1,3,5,7], [1,3,5,7], [1,3,5,7]],
|
||||
|
||||
"segment_size": 6400,
|
||||
"num_mels": 80,
|
||||
"num_freq": 1025,
|
||||
"n_fft": 2048,
|
||||
"hop_size": 200,
|
||||
"win_size": 1000,
|
||||
|
||||
"sampling_rate": 16000,
|
||||
|
||||
"fmin": 0,
|
||||
"fmax": 8000,
|
||||
"fmax_for_loss": null,
|
||||
|
||||
"num_workers": 4,
|
||||
|
||||
"dist_config": {
|
||||
"dist_backend": "nccl",
|
||||
"dist_url": "tcp://localhost:54312",
|
||||
"world_size": 1
|
||||
}
|
||||
}
|
||||
},
|
||||
"pipeline": {
|
||||
"type": "sambert-hifigan-tts"
|
||||
}
|
||||
}
|
||||
BIN
description/hifigan.jpg
Normal file
BIN
description/hifigan.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 51 KiB |
BIN
description/sambert.jpg
Normal file
BIN
description/sambert.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 67 KiB |
BIN
description/tts-system.jpg
Normal file
BIN
description/tts-system.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 64 KiB |
BIN
resource.zip
(Stored with Git LFS)
Normal file
BIN
resource.zip
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
voices.zip
(Stored with Git LFS)
Normal file
BIN
voices.zip
(Stored with Git LFS)
Normal file
Binary file not shown.
Reference in New Issue
Block a user