Skip to content

Accept Voice-Based Orders

In this level, you’re taking your agent hands-free.
You’ll enable customers to place pizza orders by speaking, using speech recognition (and optionally speech synthesis) so the agent can listen, understand, confirm, and create orders end-to-end.
By the end, your agent will feel like a voice kiosk: “I’d like two large pepperoni pizzas”“Got it! Confirm?”Order created. 🎙️🍕

📋 Tasks

  • Deploy a speech-capable model in Microsoft Foundry (speech-to-text or speech-to-speech).
  • Capture audio input (e.g., in the Agent Playground or your app) and stream/transcribe to the agent.
  • Parse the intent and entities (size, crust, toppings, quantity, store) from the transcript.
  • Confirm the order by voice (read back the order and ask for yes/no).
  • Create the order via your existing API/MCP flow after confirmation.
  • Return a spoken and/or text receipt (order number + ETA).

✅ Pass Criteria and Requirements

  • A model that supports speech input (STT or STS) is deployed and used by the agent.
  • The agent correctly identifies the pizza order from voice input (items, quantities, key options).
  • The user hears/sees a confirmation that the order was successfully created (e.g., read-back + order ID).

💡 Hints & Tips

  • Use Microsoft Foundry to deploy STT or Realtime (speech-to-speech) models that accept audio.
  • The Agent Playground offers a UI to provide audio input — great for quick testing.
  • Design the conversation as a 3-step flow:
    1. Listen & transcribe (capture audio, produce text)
    2. Confirm (summarize order; handle corrections)
    3. Create order (call your existing API/MCP; then acknowledge with order ID)
  • Make confirmation explicit to avoid accidental orders (“Did you mean two large pepperoni pizzas for Store Downtown? Say ‘confirm’ to place it.”).
  • Handle noisy inputs by asking clarifying questions (e.g., “How many pizzas?” or “Which store?”).

📚 Resources