Voxta Server Voxta Cloud Integrations

Using Voxta

Interface Guide

Language

Large Language Models

Voice

Cloud-Based

ElevenLabs Azure Speech Service

Self-Hosted: Zero-Setup

Orpheus Chatterbox TTS XTTS (Coqui)Echo-TTS F5-TTS Kitten TTS Kokoro Sesame Conversational Speech Model Windows Speech

Self-Hosted: External

Remote TTS (HTTP API)

Vision & Generation

Computer Vision

Image Generation

Knowledge

Augmentations

Chat Augmentations

Motion & Hardware

Other

Voxta Cloud (as a service)Discord

Reference

Articles & Concepts

Troubleshooting

Voxta Server Services Text-to-Speech

Sesame Conversational Speech Model

Generates RVQ audio codes from text and audio inputs — Sesame's speech generation model.

Sesame CSM is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.

Sesame CSM is currently marked experimental — the runtime and config surface may change.

Setup

Add the service

Manage Services → + Add Services → Sesame Conversational Speech Model → Add. Voxta installs the Python runtime and model weights automatically on first use.

Pick a voice

In the Sesame CSM config, browse available voices.

Kokoro

Frontier-quality TTS at just 82M parameters — fast and great-sounding.

Windows Speech

Built-in Windows TTS (and STT). Free, offline, basic quality.

On this page

Setup Add the service Pick a voice