Files
VisionScannerService/README.md

3.3 KiB

Vision Scanner Service

TypeScript/Express service that scans shelf and pantry photos to extract product information and prices using a local vision LLM running on an AMD NPU. Uses ChromaDB for vector embeddings storage and Ollama for embedding generation. Supports image tiling for high-resolution photos.

How It Works

  1. A photo of a store shelf or pantry is uploaded
  2. The image is tiled into smaller sections for better accuracy on high-res photos
  3. Each tile is sent to the vision LLM (qwen2.5vl-it:3b via FLM proxy) for product extraction
  4. Extracted products are matched against existing entries using vector embeddings (ChromaDB + Ollama)
  5. Optionally enriched via Gemini API as a fallback

Endpoints

Method Path Description
POST /scan/shelf Scan a store shelf photo (multipart: image, store_name)
POST /scan/pantry Scan a pantry photo (multipart: image)
POST /enrich/product Extract detailed product info from a single product image
GET /health Health check (reports status of vision model, Ollama, ChromaDB)

Configuration

All configuration is via environment variables (.env file):

Variable Default Description
PORT 8002 Service port
VISION_AI_URL http://localhost:8000/v1/chat/completions Vision LLM endpoint
VISION_AI_MODEL qwen2.5vl-it:3b Vision model to use
VISION_AI_TIMEOUT 120000 Timeout for vision LLM calls (ms)
OLLAMA_HOST http://192.168.0.15:11434 Ollama server for embeddings
OLLAMA_EMBED_MODEL nomic-embed-text Embedding model
CHROMA_HOST http://192.168.0.15:8000 ChromaDB server
GEMINI_API_KEY Optional Gemini API key for fallback
GEMINI_MODEL gemini-2.5-flash Gemini model for fallback
MAX_CONCURRENT_TILES 4 Max parallel tile processing
UPLOAD_DIR uploads Temporary upload directory

Usage

npm install        # Install dependencies
npm run build      # Compile TypeScript → dist/
npm start          # Run the service
npm run dev        # Development mode with hot-reload

# Windows service
node service-install.js
node service-uninstall.js

Project Structure

src/
  server.ts      — Express app, routes
  config.ts      — Configuration from environment
  vision.ts      — Vision LLM API calls
  tiling.ts      — Image tiling for high-res photos
  shelf.ts       — Shelf scanning logic
  pantry.ts      — Pantry scanning logic
  enrich.ts      — Product info enrichment
  parsing.ts     — LLM response parsing
  embeddings.ts  — Ollama embedding generation
  chroma.ts      — ChromaDB vector storage
  matching.ts    — Product matching via embeddings
  gemini.ts      — Gemini API fallback

External Dependencies

  • FLM Proxy (localhost:8000) — Vision LLM inference on AMD NPU
  • Ollama (192.168.0.15:11434) — Embedding generation with nomic-embed-text
  • ChromaDB (192.168.0.15:8000) — Vector database for product embeddings
  • Gemini API (optional) — Fallback for product enrichment

Environment

  • OS: Windows 11, AMD NPU hardware
  • Runtime: Node.js + TypeScript
  • Vision LLM: qwen2.5vl-it:3b served by FLM proxy on localhost:8000