Visual Ordering

In this level, you’ll enable image-based ordering.
Users upload a photo of a pizza, your agent analyzes it (base + toppings), maps it to a menu item, and creates the order—then confirms. 📸🍕

📋 Tasks

[ ] Add the capability to accept an image as input (Playground upload or app UI).
[ ] Use the image to infer the pizza and create an order (“Order a pizza like this image”).
[ ] Extract toppings and base from the image, match the closest menu pizza, and add extra toppings.

✅ Pass Criteria

The agent identifies the pizza from the image and creates the order.
The user receives a confirmation (text and/or voice) with order details.

🛠️ Hints & Tips

Deploy a multimodal model that supports image input in Microsoft Foundry.
Use the Agent Playground to test image uploads quickly.
Design the pipeline: Image → Vision analysis (base + toppings) → Menu match → Confirmation → Order.
If confidence is low, ask clarifying questions (“Is that pepperoni or spicy salami?”).

📚 Resources

Get started with multimodal vision chat apps using Azure OpenAI