Skip to main content
Internal Service API

These endpoints are called internally by the Go API Gateway's chef_mode_voice.go handler. Client applications connect via WebSocket at /api/v1/chef-mode/voice/ws through the gateway rather than calling these endpoints directly.

Voice AI Chef Mode API

REST endpoints powering the Chef Mode voice assistant for hands-free kitchen operation. Chef Mode enables kitchen staff to interact with AI using voice commands while keeping their hands free during food preparation and service.

The system supports recipe guidance, ingredient substitutions, cooking timers, plating suggestions, food safety information, and station-specific assistance (grill, saute, fry, pastry, prep).

Issue References: #492 (Chef Mode Voice WebSocket), Epic #705 (Conversational AI Interface)

Overview

AttributeValue
Base Path/api/ai/voice
Router Tagvoice-ai
AuthenticationInternal service-to-service (Go Gateway to Python Analytics)
AI Model TierT2 (Gemini 2.0 Flash) for fast, low-latency responses
Max ContextLast 10 conversation turns
Max Response Tokens500 (concise kitchen-friendly answers)

Architecture Flow

Flutter Client (KDS Shell)
|
| WebSocket: /api/v1/chef-mode/voice/ws
v
Go API Gateway (chef_mode_voice.go)
|
| REST: POST /api/ai/voice/stream
| REST: POST /api/ai/voice/query
v
Python Analytics Service (voice_ai_chef_routes.py)
|
| Speech-to-Text / Text-to-Speech
| ACP AI Router (T2 tier)
v
AI Response returned to client

Health Check

Check if the Voice AI Chef Mode service is healthy and available.

GET /api/ai/voice/health

Response

{
"status": "healthy",
"service": "voice-ai-chef"
}
FieldTypeDescription
statusstringService health status: "healthy" or "unhealthy"
servicestringService identifier

Stream Audio for Transcription

Send a base64-encoded audio chunk for speech-to-text transcription. The Go API Gateway calls this endpoint for each audio chunk received over the WebSocket connection from the Flutter client.

POST /api/ai/voice/stream
Content-Type: application/json

Request Body

{
"session_id": "voice-1708012345678",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
"encoding": "LINEAR16",
"sample_rate_hertz": 16000
}
FieldTypeRequiredDefaultDescription
session_idstringYes--Voice session identifier
audiostringYes--Base64-encoded audio data
encodingstringNo"LINEAR16"Audio encoding format
sample_rate_hertzintegerNo16000Audio sample rate in Hz

Response

{
"session_id": "voice-1708012345678",
"transcript": "How long should I sear the ribeye?",
"is_final": true,
"confidence": 0.94,
"latency_ms": 185
}
FieldTypeDescription
session_idstringVoice session identifier
transcriptstring or nullTranscribed text, or null if no speech detected
is_finalbooleanWhether this is a final transcription result
confidencefloatTranscription confidence score (0.0 to 1.0)
latency_msintegerProcessing time in milliseconds

Process Text Query

Send a text query to the Chef Mode AI assistant and receive a contextual response. This endpoint is called by the Go API Gateway after transcription completes, or when the user sends a text query directly.

The AI uses a kitchen-specialized system prompt and the T2 model tier (Gemini 2.0 Flash) for fast, low-latency responses optimized for service environments.

POST /api/ai/voice/query
Content-Type: application/json

Request Body

{
"session_id": "voice-1708012345678",
"text": "How long should I sear the ribeye?",
"context": [
{
"role": "user",
"content": "What temp for medium-rare ribeye?",
"timestamp": "2026-02-20T18:30:00Z"
},
{
"role": "assistant",
"content": "For medium-rare ribeye, pull it off the heat at 130F internal. It will carry over to about 135F while resting.",
"timestamp": "2026-02-20T18:30:01Z"
}
],
"tenant_id": "550e8400-e29b-41d4-a716-446655449100",
"location_id": "550e8400-e29b-41d4-a716-446655449110",
"station": "grill"
}
FieldTypeRequiredDescription
session_idstringYesVoice session identifier
textstringYesText query from the user
contextarrayNoPrevious conversation turns (max 10 retained)
context[].rolestringYes"user" or "assistant"
context[].contentstringYesMessage content
context[].timestampstringNoISO 8601 timestamp
tenant_idstringYesTenant identifier
location_idstringNoLocation identifier
stationstringNoKitchen station: "grill", "saute", "fry", "pastry", "prep", etc.

Response

{
"session_id": "voice-1708012345678",
"text": "For a 1-inch ribeye, sear 3-4 minutes per side on high heat. Use the hand test or a thermometer to check doneness. Let it rest 5 minutes before plating.",
"audio": null,
"latency_ms": 320
}
FieldTypeDescription
session_idstringVoice session identifier
textstringAI response text
audiostring or nullBase64-encoded audio response (reserved for future TTS integration)
latency_msintegerProcessing time in milliseconds

Station Context

When a station value is provided, the AI system prompt is augmented with station-specific context, improving the relevance of responses for that kitchen area.

StationGuidance Focus
grillSearing, temperatures, timing, grill marks
sautePan techniques, heat control, sauce building
fryOil temperatures, breading, drain times
pastryBaking temps, dough handling, decoration
prepKnife work, mise en place, batch prep

Graceful Degradation

If the AI service is unavailable, the endpoint returns a fallback response instead of an error, ensuring the voice session can continue:

{
"session_id": "voice-1708012345678",
"text": "I'm sorry, I couldn't process your request at this time. Please try again.",
"audio": null,
"latency_ms": 5
}

Transcribe and Respond

Combined one-shot endpoint that transcribes audio and processes the result with AI in a single request. Useful for simpler interaction flows where streaming is not needed.

POST /api/ai/voice/transcribe-and-respond
Content-Type: application/json

Request Body

{
"session_id": "voice-1708012345678",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
"context": [],
"tenant_id": "550e8400-e29b-41d4-a716-446655449100",
"station": "grill"
}
FieldTypeRequiredDescription
session_idstringYesVoice session identifier
audiostringYesBase64-encoded audio data
contextarrayNoPrevious conversation turns
tenant_idstringNoTenant identifier
stationstringNoKitchen station identifier

Response

Returns a VoiceQueryResponse with the AI response to the transcribed audio:

{
"session_id": "voice-1708012345678",
"text": "For a 1-inch ribeye, sear 3-4 minutes per side on high heat. Let it rest 5 minutes before plating.",
"audio": null,
"latency_ms": 520
}

If transcription fails or produces no usable text, the endpoint returns a prompt to retry:

{
"session_id": "voice-1708012345678",
"text": "I didn't catch that. Could you please repeat?",
"audio": null,
"latency_ms": 150
}

Get Session Context

Retrieve conversation context for an active voice session.

GET /api/ai/voice/sessions/{session_id}/context

Path Parameters

ParameterTypeDescription
session_idstringVoice session identifier

Response

{
"session_id": "voice-1708012345678",
"context": [],
"message": "Session context is managed client-side"
}
FieldTypeDescription
session_idstringVoice session identifier
contextarrayConversation turns (currently empty; context is managed client-side)
messagestringStatus message
note

Session context is currently managed client-side in the Go API Gateway's VoiceSession struct. The gateway maintains up to 10 conversation turns per session and passes them to the /query endpoint with each request. This endpoint is reserved for future server-side session persistence via Redis or Cloud Spanner.


Error Handling

HTTP Status Codes

StatusDescription
200Success
400Invalid request (e.g., malformed base64 audio)
500Internal server error

Error Response Format

{
"detail": "Invalid base64 audio data"
}

Common Errors

ErrorEndpointCause
Invalid base64 audio data/streamThe audio field contains data that is not valid base64
I'm sorry, I couldn't process your request at this time./queryAI Gateway unreachable or returned an error (returned as 200 with fallback text)
I didn't catch that. Could you please repeat?/transcribe-and-respondTranscription failed or produced no usable text (returned as 200 with fallback text)

Resilience Behavior

The /query and /transcribe-and-respond endpoints are designed to return fallback text responses (HTTP 200) rather than error status codes when the AI service is unavailable. This ensures the WebSocket voice session remains active and the user receives feedback, even during partial outages.


Data Models

VoiceStreamRequest

FieldTypeRequiredDefaultDescription
session_idstringYes--Voice session ID
audiostringYes--Base64-encoded audio data
encodingstringNo"LINEAR16"Audio encoding format
sample_rate_hertzintegerNo16000Sample rate in Hz

VoiceStreamResponse

FieldTypeDescription
session_idstringVoice session ID
transcriptstring or nullTranscribed text
is_finalbooleanWhether the transcription is final
confidencefloatConfidence score (0.0-1.0)
latency_msintegerProcessing latency in milliseconds

VoiceQueryRequest

FieldTypeRequiredDescription
session_idstringYesVoice session ID
textstringYesText query from user
contextarray of ConversationTurnNoConversation history
tenant_idstringYesTenant identifier
location_idstringNoLocation identifier
stationstringNoKitchen station (grill, saute, fry, pastry, prep)

VoiceQueryResponse

FieldTypeDescription
session_idstringVoice session ID
textstringAI response text
audiostring or nullBase64-encoded audio (reserved for future TTS)
latency_msintegerProcessing latency in milliseconds

ConversationTurn

FieldTypeRequiredDescription
rolestringYes"user" or "assistant"
contentstringYesMessage content
timestampstringNoISO 8601 timestamp