Accept Voice-Based Orders
In this level, you’re taking your agent hands-free.
You’ll enable customers to place pizza orders by speaking, using speech recognition (and optionally speech synthesis) so the agent can listen, understand, confirm, and create orders end-to-end.
By the end, your agent will feel like a voice kiosk: “I’d like two large pepperoni pizzas” → “Got it! Confirm?” → Order created. 🎙️🍕
📋 Tasks
- Deploy a speech-capable model in Microsoft Foundry (speech-to-text or speech-to-speech).
- Capture audio input (e.g., in the Agent Playground or your app) and stream/transcribe to the agent.
- Parse the intent and entities (size, crust, toppings, quantity, store) from the transcript.
- Confirm the order by voice (read back the order and ask for yes/no).
- Create the order via your existing API/MCP flow after confirmation.
- Return a spoken and/or text receipt (order number + ETA).
✅ Pass Criteria and Requirements
- A model that supports speech input (STT or STS) is deployed and used by the agent.
- The agent correctly identifies the pizza order from voice input (items, quantities, key options).
- The user hears/sees a confirmation that the order was successfully created (e.g., read-back + order ID).
💡 Hints & Tips
- Use Microsoft Foundry to deploy STT or Realtime (speech-to-speech) models that accept audio.
- The Agent Playground offers a UI to provide audio input — great for quick testing.
- Design the conversation as a 3-step flow:
- Listen & transcribe (capture audio, produce text)
- Confirm (summarize order; handle corrections)
- Create order (call your existing API/MCP; then acknowledge with order ID)
- Make confirmation explicit to avoid accidental orders (“Did you mean two large pepperoni pizzas for Store Downtown? Say ‘confirm’ to place it.”).
- Handle noisy inputs by asking clarifying questions (e.g., “How many pizzas?” or “Which store?”).