F5-TTS is an open-source, non-autoregressive text-to-speech (TTS) system that generates natural and expressive speech from text inputs. Utilizing Flow Matching with Diffusion Transformer (DiT), it eliminates the need for complex components like duration models or phoneme alignment. The system is designed for efficient training and inference, achieving a real-time factor (RTF) of 0.15, which is a significant improvement over previous diffusion-based TTS models.
Configuration Options
Model Settings
-
Models Directory: Specify the directory where F5-TTS models are stored. By default, models are downloaded to
Data/HuggingFace
. You can set a custom directory and refresh to update the models list. -
Model: Set the path for the F5-TTS base model. This folder must contain both the model file (
.safetensors
or.pt
) and thevocab.txt
file.-
Example:
hf:SWivid/F5-TTS:F5TTS_Base/model_1200000.safetensors,F5TTS_Base/vocab.txt
-
-
Vocos Model: Specify the path for the Vocos decoder model. You can also use a HuggingFace repository path.
-
Example:
hf:charactr/vocos-mel-24khz
-
Defaults
-
Default Female Voice: The default audio file to use when a female character does not have a specific voice configuration.
-
Example:
sampleme.wav
-
-
Default Male Voice: The default audio file to use when a male character does not have a specific voice configuration.
-
Example:
ajwphotographic.wav
-
-
Default Speed: Adjust the speed of the generated speech for characters without specific configurations. Extreme values may cause artifacts or truncation.
-
Example:
0.90
-
Device Settings
-
Use Cuda:
Enable GPU usage for faster processing. If disabled, the CPU will be used.
Advanced Settings
-
NFE Steps: Number of function evaluations (NFE) per integration step in the ODE solver. Higher values improve accuracy but increase inference time.
-
Default:
16
-
-
Target RMS: Defines the target root mean square (RMS) energy for audio, ensuring consistent loudness levels.
-
Default:
0.10
-
-
CFG Strength: Classifier-Free Guidance (CFG) strength parameter. Adjust this to balance between the diversity and fidelity of audio outputs.
-
Default:
2
-
-
Sway Sampling Coef: Coefficient for Sway Sampling, which adjusts the distribution of flow steps during inference. Negative values focus on foundational speech features like alignment.
-
Default:
-1
-
-
Cross Fade Duration: Specifies the duration for cross-fading between segments during audio generation.
-
Note: Voxta does not support audio crossfading.
-
Default:
0
-
-
ODE Method: Choose the Ordinary Differential Equation (ODE) solver method for flow step integration.
-
Default:
Euler
-
Options
-
Extend Shorter Than:
When the input text is too short, F5-TTS may cut off the output. If the text is shorter than this value, additional
...
will be prepended to improve results.