These endpoints are called internally by the Go API Gateway's chef_mode_voice.go handler. Client applications connect via WebSocket at /api/v1/chef-mode/voice/ws through the gateway rather than calling these endpoints directly.
Voice AI Chef Mode API
REST endpoints powering the Chef Mode voice assistant for hands-free kitchen operation. Chef Mode enables kitchen staff to interact with AI using voice commands while keeping their hands free during food preparation and service.
The system supports recipe guidance, ingredient substitutions, cooking timers, plating suggestions, food safety information, and station-specific assistance (grill, saute, fry, pastry, prep).
Issue References: #492 (Chef Mode Voice WebSocket), Epic #705 (Conversational AI Interface)
Overview
| Attribute | Value |
|---|---|
| Base Path | /api/ai/voice |
| Router Tag | voice-ai |
| Authentication | Internal service-to-service (Go Gateway to Python Analytics) |
| AI Model Tier | T2 (Gemini 2.0 Flash) for fast, low-latency responses |
| Max Context | Last 10 conversation turns |
| Max Response Tokens | 500 (concise kitchen-friendly answers) |
Architecture Flow
Flutter Client (KDS Shell)
|
| WebSocket: /api/v1/chef-mode/voice/ws
v
Go API Gateway (chef_mode_voice.go)
|
| REST: POST /api/ai/voice/stream
| REST: POST /api/ai/voice/query
v
Python Analytics Service (voice_ai_chef_routes.py)
|
| Speech-to-Text / Text-to-Speech
| ACP AI Router (T2 tier)
v
AI Response returned to client
Health Check
Check if the Voice AI Chef Mode service is healthy and available.
GET /api/ai/voice/health
Response
{
"status": "healthy",
"service": "voice-ai-chef"
}
| Field | Type | Description |
|---|---|---|
status | string | Service health status: "healthy" or "unhealthy" |
service | string | Service identifier |
Stream Audio for Transcription
Send a base64-encoded audio chunk for speech-to-text transcription. The Go API Gateway calls this endpoint for each audio chunk received over the WebSocket connection from the Flutter client.
POST /api/ai/voice/stream
Content-Type: application/json
Request Body
{
"session_id": "voice-1708012345678",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
"encoding": "LINEAR16",
"sample_rate_hertz": 16000
}
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
session_id | string | Yes | -- | Voice session identifier |
audio | string | Yes | -- | Base64-encoded audio data |
encoding | string | No | "LINEAR16" | Audio encoding format |
sample_rate_hertz | integer | No | 16000 | Audio sample rate in Hz |
Response
{
"session_id": "voice-1708012345678",
"transcript": "How long should I sear the ribeye?",
"is_final": true,
"confidence": 0.94,
"latency_ms": 185
}
| Field | Type | Description |
|---|---|---|
session_id | string | Voice session identifier |
transcript | string or null | Transcribed text, or null if no speech detected |
is_final | boolean | Whether this is a final transcription result |
confidence | float | Transcription confidence score (0.0 to 1.0) |
latency_ms | integer | Processing time in milliseconds |
Process Text Query
Send a text query to the Chef Mode AI assistant and receive a contextual response. This endpoint is called by the Go API Gateway after transcription completes, or when the user sends a text query directly.
The AI uses a kitchen-specialized system prompt and the T2 model tier (Gemini 2.0 Flash) for fast, low-latency responses optimized for service environments.
POST /api/ai/voice/query
Content-Type: application/json
Request Body
{
"session_id": "voice-1708012345678",
"text": "How long should I sear the ribeye?",
"context": [
{
"role": "user",
"content": "What temp for medium-rare ribeye?",
"timestamp": "2026-02-20T18:30:00Z"
},
{
"role": "assistant",
"content": "For medium-rare ribeye, pull it off the heat at 130F internal. It will carry over to about 135F while resting.",
"timestamp": "2026-02-20T18:30:01Z"
}
],
"tenant_id": "550e8400-e29b-41d4-a716-446655449100",
"location_id": "550e8400-e29b-41d4-a716-446655449110",
"station": "grill"
}
| Field | Type | Required | Description |
|---|---|---|---|
session_id | string | Yes | Voice session identifier |
text | string | Yes | Text query from the user |
context | array | No | Previous conversation turns (max 10 retained) |
context[].role | string | Yes | "user" or "assistant" |
context[].content | string | Yes | Message content |
context[].timestamp | string | No | ISO 8601 timestamp |
tenant_id | string | Yes | Tenant identifier |
location_id | string | No | Location identifier |
station | string | No | Kitchen station: "grill", "saute", "fry", "pastry", "prep", etc. |
Response
{
"session_id": "voice-1708012345678",
"text": "For a 1-inch ribeye, sear 3-4 minutes per side on high heat. Use the hand test or a thermometer to check doneness. Let it rest 5 minutes before plating.",
"audio": null,
"latency_ms": 320
}
| Field | Type | Description |
|---|---|---|
session_id | string | Voice session identifier |
text | string | AI response text |
audio | string or null | Base64-encoded audio response (reserved for future TTS integration) |
latency_ms | integer | Processing time in milliseconds |
Station Context
When a station value is provided, the AI system prompt is augmented with station-specific context, improving the relevance of responses for that kitchen area.
| Station | Guidance Focus |
|---|---|
grill | Searing, temperatures, timing, grill marks |
saute | Pan techniques, heat control, sauce building |
fry | Oil temperatures, breading, drain times |
pastry | Baking temps, dough handling, decoration |
prep | Knife work, mise en place, batch prep |
Graceful Degradation
If the AI service is unavailable, the endpoint returns a fallback response instead of an error, ensuring the voice session can continue:
{
"session_id": "voice-1708012345678",
"text": "I'm sorry, I couldn't process your request at this time. Please try again.",
"audio": null,
"latency_ms": 5
}
Transcribe and Respond
Combined one-shot endpoint that transcribes audio and processes the result with AI in a single request. Useful for simpler interaction flows where streaming is not needed.
POST /api/ai/voice/transcribe-and-respond
Content-Type: application/json
Request Body
{
"session_id": "voice-1708012345678",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
"context": [],
"tenant_id": "550e8400-e29b-41d4-a716-446655449100",
"station": "grill"
}
| Field | Type | Required | Description |
|---|---|---|---|
session_id | string | Yes | Voice session identifier |
audio | string | Yes | Base64-encoded audio data |
context | array | No | Previous conversation turns |
tenant_id | string | No | Tenant identifier |
station | string | No | Kitchen station identifier |
Response
Returns a VoiceQueryResponse with the AI response to the transcribed audio:
{
"session_id": "voice-1708012345678",
"text": "For a 1-inch ribeye, sear 3-4 minutes per side on high heat. Let it rest 5 minutes before plating.",
"audio": null,
"latency_ms": 520
}
If transcription fails or produces no usable text, the endpoint returns a prompt to retry:
{
"session_id": "voice-1708012345678",
"text": "I didn't catch that. Could you please repeat?",
"audio": null,
"latency_ms": 150
}
Get Session Context
Retrieve conversation context for an active voice session.
GET /api/ai/voice/sessions/{session_id}/context
Path Parameters
| Parameter | Type | Description |
|---|---|---|
session_id | string | Voice session identifier |
Response
{
"session_id": "voice-1708012345678",
"context": [],
"message": "Session context is managed client-side"
}
| Field | Type | Description |
|---|---|---|
session_id | string | Voice session identifier |
context | array | Conversation turns (currently empty; context is managed client-side) |
message | string | Status message |
Session context is currently managed client-side in the Go API Gateway's VoiceSession struct. The gateway maintains up to 10 conversation turns per session and passes them to the /query endpoint with each request. This endpoint is reserved for future server-side session persistence via Redis or Cloud Spanner.
Error Handling
HTTP Status Codes
| Status | Description |
|---|---|
| 200 | Success |
| 400 | Invalid request (e.g., malformed base64 audio) |
| 500 | Internal server error |
Error Response Format
{
"detail": "Invalid base64 audio data"
}
Common Errors
| Error | Endpoint | Cause |
|---|---|---|
Invalid base64 audio data | /stream | The audio field contains data that is not valid base64 |
I'm sorry, I couldn't process your request at this time. | /query | AI Gateway unreachable or returned an error (returned as 200 with fallback text) |
I didn't catch that. Could you please repeat? | /transcribe-and-respond | Transcription failed or produced no usable text (returned as 200 with fallback text) |
Resilience Behavior
The /query and /transcribe-and-respond endpoints are designed to return fallback text responses (HTTP 200) rather than error status codes when the AI service is unavailable. This ensures the WebSocket voice session remains active and the user receives feedback, even during partial outages.
Data Models
VoiceStreamRequest
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
session_id | string | Yes | -- | Voice session ID |
audio | string | Yes | -- | Base64-encoded audio data |
encoding | string | No | "LINEAR16" | Audio encoding format |
sample_rate_hertz | integer | No | 16000 | Sample rate in Hz |
VoiceStreamResponse
| Field | Type | Description |
|---|---|---|
session_id | string | Voice session ID |
transcript | string or null | Transcribed text |
is_final | boolean | Whether the transcription is final |
confidence | float | Confidence score (0.0-1.0) |
latency_ms | integer | Processing latency in milliseconds |
VoiceQueryRequest
| Field | Type | Required | Description |
|---|---|---|---|
session_id | string | Yes | Voice session ID |
text | string | Yes | Text query from user |
context | array of ConversationTurn | No | Conversation history |
tenant_id | string | Yes | Tenant identifier |
location_id | string | No | Location identifier |
station | string | No | Kitchen station (grill, saute, fry, pastry, prep) |
VoiceQueryResponse
| Field | Type | Description |
|---|---|---|
session_id | string | Voice session ID |
text | string | AI response text |
audio | string or null | Base64-encoded audio (reserved for future TTS) |
latency_ms | integer | Processing latency in milliseconds |
ConversationTurn
| Field | Type | Required | Description |
|---|---|---|---|
role | string | Yes | "user" or "assistant" |
content | string | Yes | Message content |
timestamp | string | No | ISO 8601 timestamp |
Related Resources
- Voice AI API - Full Voice AI API reference (Hey Maximus)
- Voice Sessions - Voice session lifecycle management
- AI Gateway - ACP AI Router and model tiers
- KDS API - Kitchen Display System integration