AI & Automation

Hermes Voice Mode Guide

Set up speech-to-speech voice chat with Hermes Agent on Telegram, CLI push-to-talk, and Discord voice channels — all with free providers.

Updated 17/06/2026 HermesVoiceSTTTTSfaster-whisperEdge TTSDiscordTelegram

Overview

Hermes Agent supports voice interaction across three modes:

ModePlatformStyleReal-timeStatus
Voice messagesTelegramSend voice note → hear replyNo
Push-to-talkCLIHold Ctrl+B, speak, releaseNear real-time
Voice channelsDiscordBot joins VC, multi-userYesSetup needed

All three run on entirely free providers: faster-whisper for transcription and Edge TTS for speech output. No API keys required.

How It Works

Your voice → microphone → faster-whisper (STT) → transcribed text

You hear voice ← speakers ← Edge TTS ← LLM generates response

On Telegram this is turn-based (send voice note, wait for reply). In the CLI and Discord it’s streaming — transcription and TTS happen in real-time.

STT (Speech-to-Text) Setup

Free: local faster-whisper

pip install faster-whisper
hermes config set stt.provider local

Config in ~/.hermes/config.yaml:

stt:
  enabled: true
  provider: local
  local:
    model: base          # tiny, base, small, medium, large-v3
    language: ''         # leave empty for auto-detect

First run downloads the model (~140MB for base). base balances speed and accuracy well. Use tiny if you need faster transcriptions, small or medium for better accuracy.

ProviderEnv varNotes
Groq WhisperGROQ_API_KEYFree tier available
OpenAI WhisperVOICE_TOOLS_OPENAI_KEYPaid
Mistral VoxtralMISTRAL_API_KEYPaid

TTS (Text-to-Speech) Setup

Free: Edge TTS

hermes config set tts.provider edge

Config:

tts:
  provider: edge
  edge:
    voice: en-US-AriaNeural

Other natural voices: en-US-GuyNeural, en-US-JennyNeural, en-GB-SoniaNeural, en-AU-NatashaNeural.

Streaming TTS

Hermes supports streaming TTS — audio starts playing before the full response is generated. This dramatically reduces perceived latency. Enable with:

tts:
  streaming: true

Hallucination Filter

A built-in filter prevents garbled/hallucinated speech from being spoken aloud. If the STT output looks like noise, it’s silently discarded.

ProviderEnv varNotes
ElevenLabsELEVENLABS_API_KEYFree tier, most natural
OpenAI TTSVOICE_TOOLS_OPENAI_KEYalloy, echo, fable voices
MiniMaxMINIMAX_API_KEYPaid

After changing providers, restart the gateway:

hermes gateway restart

Mode 1: Telegram Voice Messages

Already configured if you’re using the Telegram gateway.

Send a voice message in chat → Hermes transcribes with faster-whisper → responds with Edge TTS audio.

Commands in chat:

Mode 2: CLI Push-to-Talk

Start Hermes in the terminal and enable voice:

hermes
/voice on

Hold Ctrl+B to record. Hermes auto-detects silence and stops recording when you finish speaking. Your words are transcribed, sent to the LLM, and the response is spoken back through your speakers.

Config options in config.yaml:

voice:
  record_key: ctrl+b        # Hold to talk
  max_recording_seconds: 120
  auto_tts: false           # Set true to always hear voice
  beep_enabled: true        # Beep when recording starts
  silence_threshold: 200
  silence_duration: 3.0     # Seconds of silence before auto-stop

This feels closer to ChatGPT Voice Mode than Telegram because transcription and TTS happen locally with very low latency.

CLI Voice commands

CommandEffect
/voice onEnable voice-to-voice mode
/voice offDisable voice, text only
/voice ttsToggle TTS output on/off
/voice statusShow current state

Mode 3: Discord Voice Channels

The most immersive option — the bot joins a Discord voice channel, listens to everyone speaking, transcribes in real-time, processes through the LLM, and speaks replies back in the voice channel. Multiple people can talk and the bot will hear everyone.

This is the closest Hermes gets to ChatGPT’s Advanced Voice Mode.

Prerequisites

  1. A Discord bot application at https://discord.com/developers/applications
  2. Bot token with voice permissions
  3. Bot invited to your server with the correct permissions integer

Step 1: Create the Discord Application

  1. Go to https://discord.com/developers/applications
  2. Click New Application, name it (e.g., “Xena”)
  3. Go to Bot tab → Reset Token → copy the token
  4. Add to ~/.hermes/.env:
DISCORD_BOT_TOKEN=your_token_here

Step 2: Enable Privileged Gateway Intents

In Bot tab → Privileged Gateway Intents, enable all three:

IntentPurpose
Presence IntentDetect user online/offline status
Server Members IntentResolve usernames in voice channels
Message Content IntentRead message content in text channels

Step 3: Add Voice Permissions

In InstallationDefault Install SettingsGuild Install, add these scopes:

PermissionPurposeRequired
ConnectJoin voice channelsYes
SpeakPlay TTS audio in VCYes
Use Voice ActivityDetect when users are speakingYes

Permissions integers:

LevelIntegerIncludes
Text only274878286912View Channels, Send Messages, Read History, Embeds, Attachments, Threads, Reactions
Text + Voice274881432640All above + Connect, Speak

Step 4: Invite the Bot

https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&permissions=274881432640&scope=bot+applications.commands

Replace YOUR_APP_ID with your Application ID from the Developer Portal → General Information.

Note: Re-inviting the bot to a server it’s already in will update its permissions without removing it. You won’t lose any data.

Step 5: Configure Hermes Discord Platform

Add to ~/.hermes/config.yaml:

discord:
  require_mention: true
  auto_thread: true
  thread_require_mention: false
  history_backfill: true

Step 6: Restart and Configure Gateway

hermes gateway restart
hermes gateway setup   # Select Discord platform

Using Voice Channels

  1. Join a voice channel in your Discord server
  2. In a text channel, type /join — the bot joins your VC
  3. Speak normally — the bot transcribes, processes, and speaks back
  4. Type /leave to disconnect

Discord Voice Commands

CommandEffect
/joinBot joins the voice channel you’re in
/leaveBot disconnects from voice
/voice onEnable voice replies in text channels
/voice offDisable voice replies

How It Works

  1. Join detection — bot monitors who’s in the VC
  2. Speech detection — Voice Activity detects when a user is speaking
  3. Audio capture — raw audio streamed to STT engine
  4. Processing — transcribed text → LLM → response
  5. Acknowledgments — “Let me look into that…” plays while LLM processes
  6. TTS playback — spoken response streamed to voice channel

Echo Prevention

The bot automatically filters out its own audio output from the input stream to prevent echo loops.

Troubleshooting

“No audio device found” (CLI):

python3 -c "import sounddevice; print(sounddevice.query_devices())"

Bot doesn’t respond in Discord server channels: Enable Message Content Intent in Developer Portal → Bot → Privileged Gateway Intents.

Bot joins VC but doesn’t hear me: Enable Voice Activity permission and Server Members Intent.

Bot responds in text but not voice channel: Check Speak permission. Also verify the bot’s audio output isn’t muted in Discord.

Whisper returns garbage text: The hallucination filter should catch this. Try a larger model (small or medium).

Phone Calls

Not natively supported by Hermes. SMS is available via Twilio (DISCORD_BOT_TOKEN → actually TWILIO_ACCOUNT_SID + TWILIO_AUTH_TOKEN), but no voice calling/PSTN/SIP integration exists yet.

A custom bridge could theoretically be built (Twilio Voice webhook → Hermes API), but it would be a from-scratch integration project.

Configuration Reference

Full config.yaml voice section

# STT (Speech-to-Text)
stt:
  enabled: true
  provider: local       # local, groq, openai, mistral
  local:
    model: base
    language: ''

# TTS (Text-to-Speech)
tts:
  provider: edge        # edge, elevenlabs, openai, minimax, mistral, neutts
  streaming: false      # Enable for lower perceived latency
  edge:
    voice: en-US-AriaNeural

# Voice interaction (CLI)
voice:
  record_key: ctrl+b
  max_recording_seconds: 120
  auto_tts: false
  beep_enabled: true
  silence_threshold: 200
  silence_duration: 3.0

Provider comparison

Featurefaster-whisper (local)GroqOpenAI
CostFreeFree tierPaid
SpeedFast (GPU-accelerated)Very fastFast
AccuracyGood (base)Very goodExcellent
InternetModel download onlyRequiredRequired
FeatureEdge TTSElevenLabsOpenAI
CostFreeFree tierPaid
NaturalnessGoodExcellentVery good
Voices~10 English100+6
StreamingYesYesYes