WER measured via Whisper large-v3. Steps 15K–235K: single-speaker evaluation (Costel / Literature Narrator) on 15 sentences with default generation parameters. Final row: 5-voice, 18-sentence evaluation with optimized generation.
| Step | WER | Diacritics WER | Common WER | Notes |
|---|---|---|---|---|
| 15,000 | 17.4% | 12.2% | 31.5% | First checkpoint |
| 55,000 | 19.3% | 13.6% | 35.1% | |
| 60,000 | 16.9% | 9.3% | 37.8% | |
| 75,000 | 13.0% | 7.8% | 27.5% | Early best |
| 80,000 | 15.8% | 11.5% | 27.5% | |
| 90,000 | 14.9% | 8.9% | 31.7% | |
| 100,000 | 18.7% | 10.9% | 40.3% | |
| 110,000 | 34.6% | 18.2% | 79.6% | Instability spike |
| 120,000 | 19.3% | 15.6% | 29.6% | |
| 130,000 | 22.0% | 17.8% | 33.5% | |
| 140,000 | 21.1% | 18.8% | 27.5% | |
| 150,000 | 13.9% | 8.3% | 29.4% | |
| 160,000 | 22.5% | 19.3% | 31.2% | |
| 170,000 | 14.0% | 9.1% | 27.7% | |
| 180,000 | 15.2% | 10.7% | 27.7% | |
| 190,000 | 17.3% | 11.3% | 33.6% | |
| 195,000 | 12.3% | 6.8% | 27.7% | Best single-speaker |
| 200,000 | 13.2% | 6.5% | 31.7% | |
| 205,000 | 13.2% | 6.6% | 31.5% | |
| 210,000 | 15.1% | 9.7% | 29.9% | |
| 215,000 | 12.0% | 7.0% | 25.8% | |
| 220,000 | 14.7% | 9.2% | 29.6% | |
| 225,000 | 12.9% | 7.6% | 27.5% | |
| 230,000 | 11.8% | 5.4% | 29.6% | Best raw single-speaker |
| 235,000 | 12.0% | 6.3% | 27.5% | Last checkpoint (single-speaker) |
| 235,000 | 5.1% | Final: 5 voices, 18 sentences, optimized generation | ||