AI & Automation

Hermes Voice Mode Guide

Set up speech-to-speech voice chat with Hermes Agent on Telegram, CLI push-to-talk, and Discord voice channels — all with free providers.

Updated 17/06/2026 HermesVoiceSTTTTSfaster-whisperEdge TTSDiscordTelegram

Overview

Hermes Agent supports voice interaction across three modes:

Mode	Platform	Style	Real-time	Status
Voice messages	Telegram	Send voice note → hear reply	No	✅
Push-to-talk	CLI	Hold Ctrl+B, speak, release	Near real-time	✅
Voice channels	Discord	Bot joins VC, multi-user	Yes	Setup needed

All three run on entirely free providers: faster-whisper for transcription and Edge TTS for speech output. No API keys required.

How It Works

Your voice → microphone → faster-whisper (STT) → transcribed text
                                                       ↓
You hear voice ← speakers ← Edge TTS ← LLM generates response

On Telegram this is turn-based (send voice note, wait for reply). In the CLI and Discord it’s streaming — transcription and TTS happen in real-time.

STT (Speech-to-Text) Setup

Free: local faster-whisper

pip install faster-whisper
hermes config set stt.provider local

Config in ~/.hermes/config.yaml:

stt:
  enabled: true
  provider: local
  local:
    model: base          # tiny, base, small, medium, large-v3
    language: ''         # leave empty for auto-detect

First run downloads the model (~140MB for base). base balances speed and accuracy well. Use tiny if you need faster transcriptions, small or medium for better accuracy.

Paid alternatives

Provider	Env var	Notes
Groq Whisper	`GROQ_API_KEY`	Free tier available
OpenAI Whisper	`VOICE_TOOLS_OPENAI_KEY`	Paid
Mistral Voxtral	`MISTRAL_API_KEY`	Paid

TTS (Text-to-Speech) Setup

Free: Edge TTS

hermes config set tts.provider edge

Config:

tts:
  provider: edge
  edge:
    voice: en-US-AriaNeural

Other natural voices: en-US-GuyNeural, en-US-JennyNeural, en-GB-SoniaNeural, en-AU-NatashaNeural.

Streaming TTS

Hermes supports streaming TTS — audio starts playing before the full response is generated. This dramatically reduces perceived latency. Enable with:

tts:
  streaming: true

Hallucination Filter

A built-in filter prevents garbled/hallucinated speech from being spoken aloud. If the STT output looks like noise, it’s silently discarded.

Paid alternatives

Provider	Env var	Notes
ElevenLabs	`ELEVENLABS_API_KEY`	Free tier, most natural
OpenAI TTS	`VOICE_TOOLS_OPENAI_KEY`	`alloy`, `echo`, `fable` voices
MiniMax	`MINIMAX_API_KEY`	Paid

After changing providers, restart the gateway:

hermes gateway restart

Mode 1: Telegram Voice Messages

Already configured if you’re using the Telegram gateway.

Send a voice message in chat → Hermes transcribes with faster-whisper → responds with Edge TTS audio.

Commands in chat:

/voice on — enable voice replies
/voice tts — always respond with voice
/voice off — text only

Mode 2: CLI Push-to-Talk

Start Hermes in the terminal and enable voice:

hermes
/voice on

Hold Ctrl+B to record. Hermes auto-detects silence and stops recording when you finish speaking. Your words are transcribed, sent to the LLM, and the response is spoken back through your speakers.

Config options in config.yaml:

voice:
  record_key: ctrl+b        # Hold to talk
  max_recording_seconds: 120
  auto_tts: false           # Set true to always hear voice
  beep_enabled: true        # Beep when recording starts
  silence_threshold: 200
  silence_duration: 3.0     # Seconds of silence before auto-stop

This feels closer to ChatGPT Voice Mode than Telegram because transcription and TTS happen locally with very low latency.

CLI Voice commands

Command	Effect
`/voice on`	Enable voice-to-voice mode
`/voice off`	Disable voice, text only
`/voice tts`	Toggle TTS output on/off
`/voice status`	Show current state

Mode 3: Discord Voice Channels

The most immersive option — the bot joins a Discord voice channel, listens to everyone speaking, transcribes in real-time, processes through the LLM, and speaks replies back in the voice channel. Multiple people can talk and the bot will hear everyone.

This is the closest Hermes gets to ChatGPT’s Advanced Voice Mode.

Prerequisites

A Discord bot application at https://discord.com/developers/applications
Bot token with voice permissions
Bot invited to your server with the correct permissions integer

Step 1: Create the Discord Application

Go to https://discord.com/developers/applications
Click New Application, name it (e.g., “Xena”)
Go to Bot tab → Reset Token → copy the token
Add to ~/.hermes/.env:

DISCORD_BOT_TOKEN=your_token_here

Step 2: Enable Privileged Gateway Intents

In Bot tab → Privileged Gateway Intents, enable all three:

Intent	Purpose
Presence Intent	Detect user online/offline status
Server Members Intent	Resolve usernames in voice channels
Message Content Intent	Read message content in text channels

Step 3: Add Voice Permissions

In Installation → Default Install Settings → Guild Install, add these scopes:

Permission	Purpose	Required
Connect	Join voice channels	Yes
Speak	Play TTS audio in VC	Yes
Use Voice Activity	Detect when users are speaking	Yes

Permissions integers:

Level	Integer	Includes
Text only	`274878286912`	View Channels, Send Messages, Read History, Embeds, Attachments, Threads, Reactions
Text + Voice	`274881432640`	All above + Connect, Speak

Step 4: Invite the Bot

https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&permissions=274881432640&scope=bot+applications.commands

Replace YOUR_APP_ID with your Application ID from the Developer Portal → General Information.

Note: Re-inviting the bot to a server it’s already in will update its permissions without removing it. You won’t lose any data.

Step 5: Configure Hermes Discord Platform

Add to ~/.hermes/config.yaml:

discord:
  require_mention: true
  auto_thread: true
  thread_require_mention: false
  history_backfill: true

Step 6: Restart and Configure Gateway

hermes gateway restart
hermes gateway setup   # Select Discord platform

Using Voice Channels

Join a voice channel in your Discord server
In a text channel, type /join — the bot joins your VC
Speak normally — the bot transcribes, processes, and speaks back
Type /leave to disconnect

Discord Voice Commands

Command	Effect
`/join`	Bot joins the voice channel you’re in
`/leave`	Bot disconnects from voice
`/voice on`	Enable voice replies in text channels
`/voice off`	Disable voice replies

How It Works

Join detection — bot monitors who’s in the VC
Speech detection — Voice Activity detects when a user is speaking
Audio capture — raw audio streamed to STT engine
Processing — transcribed text → LLM → response
Acknowledgments — “Let me look into that…” plays while LLM processes
TTS playback — spoken response streamed to voice channel

Echo Prevention

The bot automatically filters out its own audio output from the input stream to prevent echo loops.

Troubleshooting

“No audio device found” (CLI):

python3 -c "import sounddevice; print(sounddevice.query_devices())"

Bot doesn’t respond in Discord server channels: Enable Message Content Intent in Developer Portal → Bot → Privileged Gateway Intents.

Bot joins VC but doesn’t hear me: Enable Voice Activity permission and Server Members Intent.

Bot responds in text but not voice channel: Check Speak permission. Also verify the bot’s audio output isn’t muted in Discord.

Whisper returns garbage text: The hallucination filter should catch this. Try a larger model (small or medium).

Phone Calls

Not natively supported by Hermes. SMS is available via Twilio (DISCORD_BOT_TOKEN → actually TWILIO_ACCOUNT_SID + TWILIO_AUTH_TOKEN), but no voice calling/PSTN/SIP integration exists yet.

A custom bridge could theoretically be built (Twilio Voice webhook → Hermes API), but it would be a from-scratch integration project.

Configuration Reference

Full `config.yaml` voice section

# STT (Speech-to-Text)
stt:
  enabled: true
  provider: local       # local, groq, openai, mistral
  local:
    model: base
    language: ''

# TTS (Text-to-Speech)
tts:
  provider: edge        # edge, elevenlabs, openai, minimax, mistral, neutts
  streaming: false      # Enable for lower perceived latency
  edge:
    voice: en-US-AriaNeural

# Voice interaction (CLI)
voice:
  record_key: ctrl+b
  max_recording_seconds: 120
  auto_tts: false
  beep_enabled: true
  silence_threshold: 200
  silence_duration: 3.0

Provider comparison

Feature	faster-whisper (local)	Groq	OpenAI
Cost	Free	Free tier	Paid
Speed	Fast (GPU-accelerated)	Very fast	Fast
Accuracy	Good (`base`)	Very good	Excellent
Internet	Model download only	Required	Required

Feature	Edge TTS	ElevenLabs	OpenAI
Cost	Free	Free tier	Paid
Naturalness	Good	Excellent	Very good
Voices	~10 English	100+	6
Streaming	Yes	Yes	Yes

Hermes Voice Mode Guide

Overview

How It Works

STT (Speech-to-Text) Setup

Free: local faster-whisper

Paid alternatives

TTS (Text-to-Speech) Setup

Free: Edge TTS

Streaming TTS

Hallucination Filter

Paid alternatives

Mode 1: Telegram Voice Messages

Mode 2: CLI Push-to-Talk

CLI Voice commands

Mode 3: Discord Voice Channels

Prerequisites

Step 1: Create the Discord Application

Step 2: Enable Privileged Gateway Intents

Step 3: Add Voice Permissions

Step 4: Invite the Bot

Step 5: Configure Hermes Discord Platform

Step 6: Restart and Configure Gateway

Using Voice Channels

Discord Voice Commands

How It Works

Echo Prevention

Troubleshooting

Phone Calls

Configuration Reference

Full config.yaml voice section

Provider comparison

Full `config.yaml` voice section