Hermes Voice Mode Guide
Set up speech-to-speech voice chat with Hermes Agent on Telegram, CLI push-to-talk, and Discord voice channels — all with free providers.
Overview
Hermes Agent supports voice interaction across three modes:
| Mode | Platform | Style | Real-time | Status |
|---|---|---|---|---|
| Voice messages | Telegram | Send voice note → hear reply | No | ✅ |
| Push-to-talk | CLI | Hold Ctrl+B, speak, release | Near real-time | ✅ |
| Voice channels | Discord | Bot joins VC, multi-user | Yes | Setup needed |
All three run on entirely free providers: faster-whisper for transcription and Edge TTS for speech output. No API keys required.
How It Works
Your voice → microphone → faster-whisper (STT) → transcribed text
↓
You hear voice ← speakers ← Edge TTS ← LLM generates response
On Telegram this is turn-based (send voice note, wait for reply). In the CLI and Discord it’s streaming — transcription and TTS happen in real-time.
STT (Speech-to-Text) Setup
Free: local faster-whisper
pip install faster-whisper
hermes config set stt.provider local
Config in ~/.hermes/config.yaml:
stt:
enabled: true
provider: local
local:
model: base # tiny, base, small, medium, large-v3
language: '' # leave empty for auto-detect
First run downloads the model (~140MB for base). base balances speed and accuracy well. Use tiny if you need faster transcriptions, small or medium for better accuracy.
Paid alternatives
| Provider | Env var | Notes |
|---|---|---|
| Groq Whisper | GROQ_API_KEY | Free tier available |
| OpenAI Whisper | VOICE_TOOLS_OPENAI_KEY | Paid |
| Mistral Voxtral | MISTRAL_API_KEY | Paid |
TTS (Text-to-Speech) Setup
Free: Edge TTS
hermes config set tts.provider edge
Config:
tts:
provider: edge
edge:
voice: en-US-AriaNeural
Other natural voices: en-US-GuyNeural, en-US-JennyNeural, en-GB-SoniaNeural, en-AU-NatashaNeural.
Streaming TTS
Hermes supports streaming TTS — audio starts playing before the full response is generated. This dramatically reduces perceived latency. Enable with:
tts:
streaming: true
Hallucination Filter
A built-in filter prevents garbled/hallucinated speech from being spoken aloud. If the STT output looks like noise, it’s silently discarded.
Paid alternatives
| Provider | Env var | Notes |
|---|---|---|
| ElevenLabs | ELEVENLABS_API_KEY | Free tier, most natural |
| OpenAI TTS | VOICE_TOOLS_OPENAI_KEY | alloy, echo, fable voices |
| MiniMax | MINIMAX_API_KEY | Paid |
After changing providers, restart the gateway:
hermes gateway restart
Mode 1: Telegram Voice Messages
Already configured if you’re using the Telegram gateway.
Send a voice message in chat → Hermes transcribes with faster-whisper → responds with Edge TTS audio.
Commands in chat:
/voice on— enable voice replies/voice tts— always respond with voice/voice off— text only
Mode 2: CLI Push-to-Talk
Start Hermes in the terminal and enable voice:
hermes
/voice on
Hold Ctrl+B to record. Hermes auto-detects silence and stops recording when you finish speaking. Your words are transcribed, sent to the LLM, and the response is spoken back through your speakers.
Config options in config.yaml:
voice:
record_key: ctrl+b # Hold to talk
max_recording_seconds: 120
auto_tts: false # Set true to always hear voice
beep_enabled: true # Beep when recording starts
silence_threshold: 200
silence_duration: 3.0 # Seconds of silence before auto-stop
This feels closer to ChatGPT Voice Mode than Telegram because transcription and TTS happen locally with very low latency.
CLI Voice commands
| Command | Effect |
|---|---|
/voice on | Enable voice-to-voice mode |
/voice off | Disable voice, text only |
/voice tts | Toggle TTS output on/off |
/voice status | Show current state |
Mode 3: Discord Voice Channels
The most immersive option — the bot joins a Discord voice channel, listens to everyone speaking, transcribes in real-time, processes through the LLM, and speaks replies back in the voice channel. Multiple people can talk and the bot will hear everyone.
This is the closest Hermes gets to ChatGPT’s Advanced Voice Mode.
Prerequisites
- A Discord bot application at https://discord.com/developers/applications
- Bot token with voice permissions
- Bot invited to your server with the correct permissions integer
Step 1: Create the Discord Application
- Go to https://discord.com/developers/applications
- Click New Application, name it (e.g., “Xena”)
- Go to Bot tab → Reset Token → copy the token
- Add to
~/.hermes/.env:
DISCORD_BOT_TOKEN=your_token_here
Step 2: Enable Privileged Gateway Intents
In Bot tab → Privileged Gateway Intents, enable all three:
| Intent | Purpose |
|---|---|
| Presence Intent | Detect user online/offline status |
| Server Members Intent | Resolve usernames in voice channels |
| Message Content Intent | Read message content in text channels |
Step 3: Add Voice Permissions
In Installation → Default Install Settings → Guild Install, add these scopes:
| Permission | Purpose | Required |
|---|---|---|
| Connect | Join voice channels | Yes |
| Speak | Play TTS audio in VC | Yes |
| Use Voice Activity | Detect when users are speaking | Yes |
Permissions integers:
| Level | Integer | Includes |
|---|---|---|
| Text only | 274878286912 | View Channels, Send Messages, Read History, Embeds, Attachments, Threads, Reactions |
| Text + Voice | 274881432640 | All above + Connect, Speak |
Step 4: Invite the Bot
https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&permissions=274881432640&scope=bot+applications.commands
Replace YOUR_APP_ID with your Application ID from the Developer Portal → General Information.
Note: Re-inviting the bot to a server it’s already in will update its permissions without removing it. You won’t lose any data.
Step 5: Configure Hermes Discord Platform
Add to ~/.hermes/config.yaml:
discord:
require_mention: true
auto_thread: true
thread_require_mention: false
history_backfill: true
Step 6: Restart and Configure Gateway
hermes gateway restart
hermes gateway setup # Select Discord platform
Using Voice Channels
- Join a voice channel in your Discord server
- In a text channel, type
/join— the bot joins your VC - Speak normally — the bot transcribes, processes, and speaks back
- Type
/leaveto disconnect
Discord Voice Commands
| Command | Effect |
|---|---|
/join | Bot joins the voice channel you’re in |
/leave | Bot disconnects from voice |
/voice on | Enable voice replies in text channels |
/voice off | Disable voice replies |
How It Works
- Join detection — bot monitors who’s in the VC
- Speech detection — Voice Activity detects when a user is speaking
- Audio capture — raw audio streamed to STT engine
- Processing — transcribed text → LLM → response
- Acknowledgments — “Let me look into that…” plays while LLM processes
- TTS playback — spoken response streamed to voice channel
Echo Prevention
The bot automatically filters out its own audio output from the input stream to prevent echo loops.
Troubleshooting
“No audio device found” (CLI):
python3 -c "import sounddevice; print(sounddevice.query_devices())"
Bot doesn’t respond in Discord server channels: Enable Message Content Intent in Developer Portal → Bot → Privileged Gateway Intents.
Bot joins VC but doesn’t hear me: Enable Voice Activity permission and Server Members Intent.
Bot responds in text but not voice channel: Check Speak permission. Also verify the bot’s audio output isn’t muted in Discord.
Whisper returns garbage text:
The hallucination filter should catch this. Try a larger model (small or medium).
Phone Calls
Not natively supported by Hermes. SMS is available via Twilio (DISCORD_BOT_TOKEN → actually TWILIO_ACCOUNT_SID + TWILIO_AUTH_TOKEN), but no voice calling/PSTN/SIP integration exists yet.
A custom bridge could theoretically be built (Twilio Voice webhook → Hermes API), but it would be a from-scratch integration project.
Configuration Reference
Full config.yaml voice section
# STT (Speech-to-Text)
stt:
enabled: true
provider: local # local, groq, openai, mistral
local:
model: base
language: ''
# TTS (Text-to-Speech)
tts:
provider: edge # edge, elevenlabs, openai, minimax, mistral, neutts
streaming: false # Enable for lower perceived latency
edge:
voice: en-US-AriaNeural
# Voice interaction (CLI)
voice:
record_key: ctrl+b
max_recording_seconds: 120
auto_tts: false
beep_enabled: true
silence_threshold: 200
silence_duration: 3.0
Provider comparison
| Feature | faster-whisper (local) | Groq | OpenAI |
|---|---|---|---|
| Cost | Free | Free tier | Paid |
| Speed | Fast (GPU-accelerated) | Very fast | Fast |
| Accuracy | Good (base) | Very good | Excellent |
| Internet | Model download only | Required | Required |
| Feature | Edge TTS | ElevenLabs | OpenAI |
|---|---|---|---|
| Cost | Free | Free tier | Paid |
| Naturalness | Good | Excellent | Very good |
| Voices | ~10 English | 100+ | 6 |
| Streaming | Yes | Yes | Yes |