Voice Call
Beyond interacting with Leena AI agent via chat, users can additionally interact with the agent via a voice call which has all the capabilities available that are accessible on chat. This new voice call functionality is designed to provide a more natural, convenient, and accessible way to get the help you need, especially for complex requests and execute exterprise workflows.
How to Start a Voice Call
Getting started is simple. Just follow these steps:
-
Locate the Icon: In the chat interface of the Leena AI web app, you will now see a new "Start Voice Call" icon beside the chat composer.
-
Grant Permission: The first time you use this feature, your browser will ask for permission to access your microphone. Please grant access to proceed.
-
Start Talking: Once the call connects, you can start talking to the virtual agent and mention your request/issue over audio.
What to Expect During Your Call
To make your experience as smooth as possible, we've included several helpful features:
-
Transcription: As you speak, your words will be transcribed and displayed in the chat window in near real-time. The agent's spoken responses will also appear as text, providing a complete record of your conversation for later reference and being able to access any links that were provided as part of response.
-
Interactive Content: If the virtual agent needs you to click a link, fill out a form, or press a button, these elements will appear directly in the chat interface for you to interact with during the call.
-
Call Controls: You have full control over the call. You can easily mute or unmute your microphone and end the call whenever you wish.
Tips for the Best Experience
To ensure the best performance and accuracy, please keep the following in mind:
- Internet Connection: The quality of the voice call depends on your internet bandwidth.
- Microphone Quality: A clear audio input from your microphone will improve the accuracy of the transcription.
- Background Noise: Try to make calls from a quiet environment, as background noise can affect the agent's ability to understand you.
What to expect later phases?
What we have currently is just phase 1. We will have more capabilties as we move ahead:
- More Languages: While the voice capability currently supports English only, we do plan to support more languages in future after stablizing English.
- Voice customization: In future, the voice agent will become customizable to different 'voices' and maybe regional dialects.
- Phone Calling: The voice call ability on the virtual assistant is the beginning but the future is having a phone number that can be called to interact with the Leena AI agent.
Behind the Scenes
Key components
-
Retell.ai: session management + real‑time ASR; provides interruption/breakpoint signals and streams text to Leena AI
-
Voice Orchestrator (perimeter fast path): GPT‑4.1 configured to
- Instantly acknowledge/clarify
- Provide micro‑responses
- Gate & forward complex turns to the core
-
Core Orchestrator (Leena Autonomous Agent): planning, tool selection, execution, knowledge grounding, and long‑running job coordination.
-
Text-to-Speech (TTS): Retell.ai primary; hot‑swappable secondaries for automatic failover; voice persona per customer.
Request Lifecycle
- Call start: Client app opens secure WebSocket (WSS) to Retell.ai; audio is streamed bi‑directionally.
- Real‑time ASR: Retell.ai emits partial/final transcripts to Leena over WSS (encrypted in transit).
- Perimeter fast‑ack: Voice Orchestrator consumes streaming text; issues instant natural acks (e.g. "Got it—checking your PTO") without blocking on the core.
- Routing:
- Simple/safe (greetings, confirmations) go to Voice Orchestrator answers directly
- Transactional/complex go to Core Orchestrator.
- Execution (Core): intent grounding -> plan -> tool/API calls (e.g. HRIS). Context kept minimal (see latency section) and cached when safe.
- Progress updates: Core streams machine‑readable status ("checking KB", "calling Time‑Off API"), which Voice Orchestrator rephrases as natural speech.
- Response delivery: Final text -> TTS (Retell.ai preferred) -> audio frames to Retell.ai -> back to user over the same WSS channel.
- Observability: Per call, Leena stores metadata, audio, transcripts (chronological), summaries, and sentiment for analytics & debugging (at customer’s hosting region).
- Barge‑in/interrupts:
- Retell.ai signals user interrupt events mid‑utterance.
- Voice Orchestrator arbitrates: continue / cancel / queue relative to the Core’s in‑flight job, using Core state to determine if rollback is possible.
- Auto Termination: Call gets automatically terminated in 2 cases:
- If the user is silent for more than 2 mins
- If the call goes beyond 40 mins (this is a configurable value)
Latency design (why it feels instant)
We achieve near‑instant acks (less than human pause), smooth turn‑taking, fast task completion via:
- Perimeter fast path using GPT‑4.1 to produce immediate, context‑aware acks and simple replies.
- Streaming ASR: text tokens arrive as the user speaks; no end‑of‑speech blocking.
- Progressive disclosure: user hears meaningful updates while the Core executes.
- Context diet: Core strictly limits prompt/context size per turn; uses selective retrieval and memoized tool results (caching) where permissible.
- Parallelism: plan & tool‑prep in parallel with TTS buffering of non‑critical preambles.
- Vendor locality roadmap: future regional ASR/TTS to reduce RTT where needed.
Dialogue quality: prosody, sentiment, and control
- Naturalness default: modern TTS voices are humanlike even without SSML
- Production control: for long or sensitive reads (spelling, reset steps), we apply SSML (pace, pauses, emphasis, repeat) on top of vendor defaults.
- Sentiment alignment: Voice Orchestrator infers user sentiment and mirrors tone (e.g. upbeat for holidays, calm for issues) by choosing phrasing and SSML cues.
Security, privacy, and data residency
- Transport security: All hops use WSS (TLS), bi‑directional.
- Client to Retell.ai (audio up)
- Retell.ai to Leena (transcripts)
- Leena to TTS to Retell.ai (audio down)
- Retention controls at the subprocessor: Retell.ai is configured not to persist call data beyond a minimal operational window; Leena stores authoritative logs.
- Hosting region: Leena persists audio/transcripts in the same region as the customer’s Leena tenant.
- Constraint: Retell.ai currently hosts in US regions only; this can block strict‑localization customers (e.g. some ME countries, EU‑only mandates). But, Retell.ai does not store any data like call logs etc.
- Roadmap: evaluate in‑house / open‑source voice stack with pluggable ASR/TTS to unlock additional hosting regions and full data‑plane control.
Updated 3 days ago
