Architecture
Talkative Lobster is an Electron desktop app. The main process handles voice processing and gateway communication, while the renderer process displays the UI and captures microphone input.
Process Overview
┌─────────────────────────────────────────────────┐
│ Main Process │
│ │
│ Orchestrator ── coordinates all engines │
│ │ │
│ ├── Voice State Machine │
│ │ idle → listening → processing │
│ │ → thinking → speaking │
│ │ │
│ ├── STT (Speech-to-Text) │
│ │ ElevenLabs / Whisper / whisper.cpp │
│ │ │
│ ├── TTS (Text-to-Speech) │
│ │ ElevenLabs / VOICEVOX │
│ │ / Kokoro / Piper │
│ │ │
│ └── Gateway Client │
│ WebSocket → OpenClaw LLM gateway │
│ │
└────────────── IPC (contextBridge) ──────────────┘
│
┌───────────────────────┴─────────────────────────┐
│ Renderer Process │
│ │
│ Voice View ── main conversation UI │
│ └── Waveform ── audio visualization │
│ Setup Modal ── settings & connectivity checks │
│ │
│ VAD (Voice Activity Detection) │
│ └── Silero neural network model │
│ Speaker Monitor ── filters out system audio │
│ Audio Playback ── TTS + aizuchi audio │
│ │
└─────────────────────────────────────────────────┘Data Flow
- Microphone → VAD detects speech start/end
- Audio chunks → sent to main process via IPC
- STT → converts audio to text
- Orchestrator → sends text to OpenClaw gateway via WebSocket
- LLM → streams response tokens back
- TTS → synthesizes audio from response text
- Renderer → plays audio through speakers
Voice State Machine
The conversation lifecycle is managed by a state machine:
| State | Description |
|---|---|
idle | Waiting for user to speak |
listening | Speech detected, recording audio |
processing | Converting speech to text |
thinking | Waiting for LLM response |
speaking | Playing AI response audio |
Transitions happen automatically. The user can interrupt during speaking by starting to talk, which transitions back to listening.