mirror of
https://www.modelscope.cn/FunAudioLLM/Fun-CosyVoice3-0.5B-2512.git
synced 2026-04-02 15:02:53 +08:00
add llm.rl.pt
This commit is contained in:
1
.gitattributes
vendored
1
.gitattributes
vendored
@ -52,3 +52,4 @@ hift.pt filter=lfs diff=lfs merge=lfs -text
|
|||||||
flow.decoder.estimator.fp32.onnx filter=lfs diff=lfs merge=lfs -text
|
flow.decoder.estimator.fp32.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
CosyVoice-BlankEN/model.safetensors filter=lfs diff=lfs merge=lfs -text
|
CosyVoice-BlankEN/model.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
campplus.onnx filter=lfs diff=lfs merge=lfs -text
|
campplus.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
llm.rl.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
|||||||
50
README.md
50
README.md
@ -1,8 +1,3 @@
|
|||||||
---
|
|
||||||
frameworks:
|
|
||||||
- ""
|
|
||||||
tasks: []
|
|
||||||
---
|
|
||||||
[](https://github.com/Akshay090/svg-banners)
|
[](https://github.com/Akshay090/svg-banners)
|
||||||
|
|
||||||
## 👉🏻 CosyVoice 👈🏻
|
## 👉🏻 CosyVoice 👈🏻
|
||||||
@ -15,22 +10,15 @@ tasks: []
|
|||||||
|
|
||||||
## Highlight🔥
|
## Highlight🔥
|
||||||
|
|
||||||
**CosyVoice 2.0** has been released! Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.
|
**CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
|
||||||
### Multilingual
|
### Key Features
|
||||||
- **Supported Language**: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
|
- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
|
||||||
- **Crosslingual & Mixlingual**:Support zero-shot voice cloning for cross-lingual and code-switching scenarios.
|
- **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
|
||||||
### Ultra-Low Latency
|
- **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
|
||||||
- **Bidirectional Streaming Support**: CosyVoice 2.0 integrates offline and streaming modeling technologies.
|
- **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
|
||||||
- **Rapid First Packet Synthesis**: Achieves latency as low as 150ms while maintaining high-quality audio output.
|
- **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
|
||||||
### High Accuracy
|
- **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
|
||||||
- **Improved Pronunciation**: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
|
|
||||||
- **Benchmark Achievements**: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.
|
|
||||||
### Strong Stability
|
|
||||||
- **Consistency in Timbre**: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
|
|
||||||
- **Cross-language Synthesis**: Marked improvements compared to version 1.0.
|
|
||||||
### Natural Experience
|
|
||||||
- **Enhanced Prosody and Sound Quality**: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
|
|
||||||
- **Emotional and Dialectal Flexibility**: Now supports more granular emotional controls and accent adjustments.
|
|
||||||
|
|
||||||
## Roadmap
|
## Roadmap
|
||||||
|
|
||||||
@ -71,6 +59,25 @@ tasks: []
|
|||||||
- [x] WeTextProcessing support when ttsfrd is not available
|
- [x] WeTextProcessing support when ttsfrd is not available
|
||||||
- [x] Fastapi server and client
|
- [x] Fastapi server and client
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
| Model | CER (%) ↓ (test-zh) | WER (%) ↓ (test-en) | CER (%) ↓ (test-hard) |
|
||||||
|
|-----|------------------|------------------|------------------|
|
||||||
|
| Human | 1.26 | 2.14 | - |
|
||||||
|
| F5-TTS | 1.53 | 2.00 | 8.67 |
|
||||||
|
| SparkTTS | 1.20 | 1.98 | - |
|
||||||
|
| Seed-TTS | 1.12 | 2.25 | 7.59 |
|
||||||
|
| CosyVoice2 | 1.45 | 2.57 | 6.83 |
|
||||||
|
| FireRedTTS-2 | 1.14 | 1.95 | - |
|
||||||
|
| IndexTTS2 | 1.01 | 1.52 | 7.12 |
|
||||||
|
| VibeVoice | 1.16 | 3.04 | - |
|
||||||
|
| HiggsAudio | 1.79 | 2.44 | - |
|
||||||
|
| MiniMax-Speech | 0.83 | 1.65 | - |
|
||||||
|
| VoxPCM | 0.93 | 1.85 | 8.87 |
|
||||||
|
| GLM-TTS | 1.03 | - | - |
|
||||||
|
| GLM-TTS_RL | 0.89 | - | - |
|
||||||
|
| CosyVoice3 | 1.21 | 2.24 | 6.71 |
|
||||||
|
| CosyVoice3_RL | 0.81 | 1.68 | 5.44 |
|
||||||
|
|
||||||
|
|
||||||
## Install
|
## Install
|
||||||
|
|
||||||
@ -240,4 +247,3 @@ You can also scan the QR code to join our official Dingding chat group.
|
|||||||
|
|
||||||
## Disclaimer
|
## Disclaimer
|
||||||
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
|
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
|
||||||
|
|
||||||
|
|||||||
4
llm.pt
4
llm.pt
@ -1,3 +1,3 @@
|
|||||||
version https://git-lfs.github.com/spec/v1
|
version https://git-lfs.github.com/spec/v1
|
||||||
oid sha256:d9565e41454e860768447f4ec5c5244a06b1ccdffdc7890fa593d907c93eebcc
|
oid sha256:69f43bd545131c30e98947fb360ea8b4dc9916d8e83dded7757c7ea4f5a24970
|
||||||
size 2024669130
|
size 2024669519
|
||||||
|
|||||||
Reference in New Issue
Block a user