Voice Architecture

Key components

  • Retell.ai: session management + real‑time ASR; provides interruption/breakpoint signals and streams text to Leena AI

  • Voice Orchestrator (perimeter fast path): GPT‑4.1 configured to

    • Instantly acknowledge/clarify
    • Provide micro‑responses
    • Gate & forward complex turns to the core
  • Core Orchestrator (Leena Autonomous Agent): planning, tool selection, execution, knowledge grounding, and long‑running job coordination.

  • Text-to-Speech (TTS): Retell.ai primary; hot‑swappable secondaries for automatic failover; voice persona per customer.

Request Lifecycle

  • Call start: Client app opens secure WebSocket (WSS) to Retell.ai; audio is streamed bi‑directionally.
  • Real‑time ASR: Retell.ai emits partial/final transcripts to Leena over WSS (encrypted in transit).
  • Perimeter fast‑ack: Voice Orchestrator consumes streaming text; issues instant natural acks (e.g. "Got it—checking your PTO") without blocking on the core.
  • Routing:
    • Simple/safe (greetings, confirmations) go to Voice Orchestrator answers directly
    • Transactional/complex go to Core Orchestrator.
  • Execution (Core): intent grounding -> plan -> tool/API calls (e.g. HRIS). Context kept minimal (see latency section) and cached when safe.
  • Progress updates: Core streams machine‑readable status ("checking KB", "calling Time‑Off API"), which Voice Orchestrator rephrases as natural speech.
  • Response delivery: Final text -> TTS (Retell.ai preferred) -> audio frames to Retell.ai -> back to user over the same WSS channel.
  • Observability: Per call, Leena stores metadata, audio, transcripts (chronological), summaries, and sentiment for analytics & debugging (at customer’s hosting region).
  • Barge‑in/interrupts:
    • Retell.ai signals user interrupt events mid‑utterance.
    • Voice Orchestrator arbitrates: continue / cancel / queue relative to the Core’s in‑flight job, using Core state to determine if rollback is possible.
  • Auto Termination: Call gets automatically terminated in 2 cases:
    • If the user is silent for more than 2 mins
    • If the call goes beyond 40 mins (this is a configurable value)

Latency design (why it feels instant)

We achieve near‑instant acks (less than human pause), smooth turn‑taking, fast task completion via:

  • Perimeter fast path using GPT‑4.1 to produce immediate, context‑aware acks and simple replies.
  • Streaming ASR: text tokens arrive as the user speaks; no end‑of‑speech blocking.
  • Progressive disclosure: user hears meaningful updates while the Core executes.
  • Context diet: Core strictly limits prompt/context size per turn; uses selective retrieval and memoized tool results (caching) where permissible.
  • Parallelism: plan & tool‑prep in parallel with TTS buffering of non‑critical preambles.
  • Vendor locality roadmap: future regional ASR/TTS to reduce RTT where needed.

Dialogue quality: prosody, sentiment, and control

  • Naturalness default: modern TTS voices are humanlike even without SSML
  • Production control: for long or sensitive reads (spelling, reset steps), we apply SSML (pace, pauses, emphasis, repeat) on top of vendor defaults.
  • Sentiment alignment: Voice Orchestrator infers user sentiment and mirrors tone (e.g. upbeat for holidays, calm for issues) by choosing phrasing and SSML cues.

Security, privacy, and data residency

  • Transport security: All hops use WSS (TLS), bi‑directional.
    • Client to Retell.ai (audio up)
    • Retell.ai to Leena (transcripts)
    • Leena to TTS to Retell.ai (audio down)
  • Retention controls at the subprocessor: Retell.ai is configured not to persist call data beyond a minimal operational window; Leena stores authoritative logs.
  • Hosting region: Leena persists audio/transcripts in the same region as the customer’s Leena tenant.
  • Constraint: Retell.ai currently hosts in US regions only; this can block strict‑localization customers (e.g. some ME countries, EU‑only mandates). But, Retell.ai does not store any data like call logs etc.
  • Roadmap: evaluate in‑house / open‑source voice stack with pluggable ASR/TTS to unlock additional hosting regions and full data‑plane control.