Voice Architecture

Key components

Retell.ai: session management + real‑time ASR; provides interruption/breakpoint signals and streams text to Leena AI
Voice Orchestrator (perimeter fast path): GPT‑4.1 configured to
- Instantly acknowledge/clarify
- Provide micro‑responses
- Gate & forward complex turns to the core
Core Orchestrator (Leena Autonomous Agent): planning, tool selection, execution, knowledge grounding, and long‑running job coordination.
Text-to-Speech (TTS): Retell.ai primary; hot‑swappable secondaries for automatic failover; voice persona per customer.

Request Lifecycle

Call start: Client app opens secure WebSocket (WSS) to Retell.ai; audio is streamed bi‑directionally.
Real‑time ASR: Retell.ai emits partial/final transcripts to Leena over WSS (encrypted in transit).
Perimeter fast‑ack: Voice Orchestrator consumes streaming text; issues instant natural acks (e.g. "Got it—checking your PTO") without blocking on the core.
Routing:
- Simple/safe (greetings, confirmations) go to Voice Orchestrator answers directly
- Transactional/complex go to Core Orchestrator.
Execution (Core): intent grounding -> plan -> tool/API calls (e.g. HRIS). Context kept minimal (see latency section) and cached when safe.
Progress updates: Core streams machine‑readable status ("checking KB", "calling Time‑Off API"), which Voice Orchestrator rephrases as natural speech.
Response delivery: Final text -> TTS (Retell.ai preferred) -> audio frames to Retell.ai -> back to user over the same WSS channel.
Observability: Per call, Leena stores metadata, audio, transcripts (chronological), summaries, and sentiment for analytics & debugging (at customer’s hosting region).
Barge‑in/interrupts:
- Retell.ai signals user interrupt events mid‑utterance.
- Voice Orchestrator arbitrates: continue / cancel / queue relative to the Core’s in‑flight job, using Core state to determine if rollback is possible.
Auto Termination: Call gets automatically terminated in 2 cases:
- If the user is silent for more than 2 mins
- If the call goes beyond 40 mins (this is a configurable value)

Latency design (why it feels instant)

We achieve near‑instant acks (less than human pause), smooth turn‑taking, fast task completion via:

Perimeter fast path using GPT‑4.1 to produce immediate, context‑aware acks and simple replies.
Streaming ASR: text tokens arrive as the user speaks; no end‑of‑speech blocking.
Progressive disclosure: user hears meaningful updates while the Core executes.
Context diet: Core strictly limits prompt/context size per turn; uses selective retrieval and memoized tool results (caching) where permissible.
Parallelism: plan & tool‑prep in parallel with TTS buffering of non‑critical preambles.
Vendor locality roadmap: future regional ASR/TTS to reduce RTT where needed.

Dialogue quality: prosody, sentiment, and control

Naturalness default: modern TTS voices are humanlike even without SSML
Production control: for long or sensitive reads (spelling, reset steps), we apply SSML (pace, pauses, emphasis, repeat) on top of vendor defaults.
Sentiment alignment: Voice Orchestrator infers user sentiment and mirrors tone (e.g. upbeat for holidays, calm for issues) by choosing phrasing and SSML cues.

Security, privacy, and data residency

Transport security: All hops use WSS (TLS), bi‑directional.
- Client to Retell.ai (audio up)
- Retell.ai to Leena (transcripts)
- Leena to TTS to Retell.ai (audio down)
Retention controls at the subprocessor: Retell.ai is configured not to persist call data beyond a minimal operational window; Leena stores authoritative logs.
Hosting region: Leena persists audio/transcripts in the same region as the customer’s Leena tenant.
Constraint: Retell.ai currently hosts in US regions only; this can block strict‑localization customers (e.g. some ME countries, EU‑only mandates). But, Retell.ai does not store any data like call logs etc.
Roadmap: evaluate in‑house / open‑source voice stack with pluggable ASR/TTS to unlock additional hosting regions and full data‑plane control.

Updated about 14 hours ago