mirror of
https://www.modelscope.cn/FunAudioLLM/Fun-CosyVoice3-0.5B-2512.git
synced 2026-04-02 23:12:53 +08:00
update
This commit is contained in:
44
README.md
44
README.md
@ -1,8 +1,3 @@
|
|||||||
---
|
|
||||||
frameworks:
|
|
||||||
- ""
|
|
||||||
tasks: []
|
|
||||||
---
|
|
||||||
[](https://github.com/Akshay090/svg-banners)
|
[](https://github.com/Akshay090/svg-banners)
|
||||||
|
|
||||||
## 👉🏻 CosyVoice 👈🏻
|
## 👉🏻 CosyVoice 👈🏻
|
||||||
@ -17,7 +12,7 @@ tasks: []
|
|||||||
|
|
||||||
**Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
|
**Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
|
||||||
### Key Features
|
### Key Features
|
||||||
- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shan1dong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
|
- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
|
||||||
- **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
|
- **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
|
||||||
- **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
|
- **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
|
||||||
- **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
|
- **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
|
||||||
@ -65,23 +60,25 @@ tasks: []
|
|||||||
- [x] Fastapi server and client
|
- [x] Fastapi server and client
|
||||||
|
|
||||||
## Evaluation
|
## Evaluation
|
||||||
| Model | CER (%) ↓ (test-zh) | WER (%) ↓ (test-en) | CER (%) ↓ (test-hard) |
|
|
||||||
|-----|------------------|------------------|------------------|
|
| Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) |
|
||||||
| Human | 1.26 | 2.14 | - |
|
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
||||||
| F5-TTS | 1.53 | 2.00 | 8.67 |
|
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
|
||||||
| SparkTTS | 1.20 | 1.98 | - |
|
| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
|
||||||
| Seed-TTS | 1.12 | 2.25 | 7.59 |
|
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
|
||||||
| CosyVoice2 | 1.45 | 2.57 | 6.83 |
|
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
|
||||||
| FireRedTTS-2 | 1.14 | 1.95 | - |
|
| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
|
||||||
| IndexTTS2 | 1.01 | 1.52 | 7.12 |
|
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
|
||||||
| VibeVoice | 1.16 | 3.04 | - |
|
| FireRedTTS 2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
|
||||||
| HiggsAudio | 1.79 | 2.44 | - |
|
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
|
||||||
| MiniMax-Speech | 0.83 | 1.65 | - |
|
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
|
||||||
| VoxPCM | 0.93 | 1.85 | 8.87 |
|
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
|
||||||
| GLM-TTS | 1.03 | - | - |
|
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
|
||||||
| GLM-TTS_RL | 0.89 | - | - |
|
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
|
||||||
| Fun-CosyVoice3-0.5B-2512 | 1.21 | 2.24 | 6.71 |
|
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
|
||||||
| Fun-CosyVoice3-0.5B-2512_RL | 0.81 | 1.68 | 5.44 |
|
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
|
||||||
|
| Fun-CosyVoice3-0.5B | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
|
||||||
|
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
|
||||||
|
|
||||||
|
|
||||||
## Install
|
## Install
|
||||||
@ -252,4 +249,3 @@ You can also scan the QR code to join our official Dingding chat group.
|
|||||||
|
|
||||||
## Disclaimer
|
## Disclaimer
|
||||||
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
|
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user