From 60ce579dc25bfe50222f6eac323b7cb0ea00315e Mon Sep 17 00:00:00 2001 From: "lyuxiang.lx" Date: Mon, 15 Dec 2025 06:49:33 +0000 Subject: [PATCH] update --- README.md | 44 ++++++++++++++++++++------------------------ 1 file changed, 20 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 175c08f..6b81b5f 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,3 @@ ---- -frameworks: -- "" -tasks: [] ---- [![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🀠&text2=Text-to-Speech%20πŸ’–%20Large%20Language%20Model&width=800&height=210)](https://github.com/Akshay090/svg-banners) ## πŸ‘‰πŸ» CosyVoice πŸ‘ˆπŸ» @@ -17,7 +12,7 @@ tasks: [] **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild. ### Key Features -- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shan1dong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning. +- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning. - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness. - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use. - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module. @@ -65,23 +60,25 @@ tasks: [] - [x] Fastapi server and client ## Evaluation -| Model | CER (%) ↓ (test-zh) | WER (%) ↓ (test-en) | CER (%) ↓ (test-hard) | -|-----|------------------|------------------|------------------| -| Human | 1.26 | 2.14 | - | -| F5-TTS | 1.53 | 2.00 | 8.67 | -| SparkTTS | 1.20 | 1.98 | - | -| Seed-TTS | 1.12 | 2.25 | 7.59 | -| CosyVoice2 | 1.45 | 2.57 | 6.83 | -| FireRedTTS-2 | 1.14 | 1.95 | - | -| IndexTTS2 | 1.01 | 1.52 | 7.12 | -| VibeVoice | 1.16 | 3.04 | - | -| HiggsAudio | 1.79 | 2.44 | - | -| MiniMax-Speech | 0.83 | 1.65 | - | -| VoxPCM | 0.93 | 1.85 | 8.87 | -| GLM-TTS | 1.03 | - | - | -| GLM-TTS_RL | 0.89 | - | - | -| Fun-CosyVoice3-0.5B-2512 | 1.21 | 2.24 | 6.71 | -| Fun-CosyVoice3-0.5B-2512_RL | 0.81 | 1.68 | 5.44 | + +| Model | Open-Source | Model Size | test-zh
CER (%) ↓ | test-zh
Speaker Similarity (%) ↑ | test-en
WER (%) ↓ | test-en
Speaker Similarity (%) ↑ | test-hard
CER (%) ↓ | test-hard
Speaker Similarity (%) | +| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - | +| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 | +| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - | +| F5-TTS | βœ… | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 | +| Spark TTS | βœ… | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - | +| CosyVoice2 | βœ… | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 | +| FireRedTTS 2 | βœ… | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - | +| Index-TTS2 | βœ… | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 | +| VibeVoice-1.5B | βœ… | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - | +| VibeVoice-Realtime | βœ… | 0.5B | - | - | 2.05 | 63.3 | - | - | +| HiggsAudio-v2 | βœ… | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - | +| VoxCPM | βœ… | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 | +| GLM-TTS | βœ… | 1.5B | 1.03 | 76.1 | - | - | - | - | +| GLM-TTS RL | βœ… | 1.5B | 0.89 | 76.4 | - | - | - | - | +| Fun-CosyVoice3-0.5B | βœ… | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 | +| Fun-CosyVoice3-0.5B-2512 | βœ… | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 | ## Install @@ -252,4 +249,3 @@ You can also scan the QR code to join our official Dingding chat group. ## Disclaimer The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal. -