
Multimodal AI Product Intelligence Platform (FastAPI, CLIP, Qdrant, Groq)
Vision + language pipeline that understands product images: search by photo, auto-tag catalogs, and answer questions about items.
Problem
A product catalog is usually a pile of photos with thin, inconsistent metadata — searchable only by whatever someone bothered to type in. The interesting version of the problem: treat the image itself as the source of truth. Upload a product photo and the system should understand it — extract structured attributes, find visually similar items, answer questions about it, and even generate channel-ready ad copy — without a human tagging anything first.
Architecture
On ingestion, every product image takes two parallel paths: CLIP produces a dense embedding that lands in Qdrant for similarity search, while Groq's vision models describe the image into structured attributes (category, features, materials, positioning). The FastAPI layer composes those primitives into five user-facing tools — product analysis, ad-campaign generation, a "twin simulator," market research, and pricing strategy — consumed by a Next.js frontend with streaming responses.
Tech decisions & trade-offs
Why CLIP + Qdrant for retrieval
CLIP embeds images and text into the same vector space, which is the whole trick: "find products like this photo" and "find products matching 'matte black mechanical keyboard'" become the same Qdrant query against the same index. Qdrant adds payload filtering on top, enabling hybrid retrieval — dense vectors for visual similarity, extracted attributes as filters. The trade-off of CLIP's general-purpose space is occasional fashion-blindness on fine-grained distinctions a fine-tuned model would catch; for a catalog-wide tool, generality wins.
Why Groq for the vision LLM
Attribute extraction and ad generation are interactive features — a user is watching a spinner. Groq's inference speed turns what would be a multi-second wait into a sub-second response (the ad-campaign run in the demo generates five channels of copy in ~1.4s), which changes how the tool feels from "batch job" to "conversation." The trade-off is model-menu lock-in: you use what Groq hosts, not any model you like.
Why FastAPI as the orchestration layer
The service is fundamentally async fan-out: one request triggers concurrent calls to CLIP, Groq, and Qdrant, then streams a composed result back. FastAPI's native async, Pydantic-typed request/response models, and SSE streaming make that shape natural, and the API documents itself for the frontend. Python also keeps the ML tooling (CLIP, tokenizers) in-process instead of behind another service boundary.