# Role in Ecosystem **ScraperControl** is the data pipeline for the Church project — handling scraping, enrichment, ChromaDB semantic search, and data transfer to Neon production. It runs on the Synology NAS (Docker), not Vercel. - **Schema sync**: Handled by `npm run sync` from the `Church/` root directory. No need to manually copy schema files. - **Coordinated deployment**: Use `npm run deploy` from `Church/` root for full pipeline deployment. - **Schema source of truth**: BethelGuide — never run `prisma migrate` in ScraperControl. --- # Claude Instructions for ScraperControl ## Project Overview **ScraperControl** is the scraping, enrichment, and data management backend for the NearestMass church finder. It provides: 1. **Admin Dashboard** (Next.js): Job management UI at port 3001 2. **Web Scrapers**: Playwright-based scrapers for extracting mass schedules from church websites 3. **Enrichment Pipelines**: Google Places, FreeSearch, reverse geocode enrichment 4. **ChromaDB Integration**: Semantic search for deduplication, content classification, and change detection 5. **Scheduler**: Database-driven job queue for automated scraping ### Shared Database Architecture ScraperControl and BethelGuide share the **same NAS PostgreSQL database** (192.168.0.145:5434). BethelGuide is the **schema source of truth**. After any schema change in BethelGuide: 1. Copy `BethelGuide/prisma/schema.prisma` → `ScraperControl/prisma/schema.prisma` 2. Run `npx prisma generate` in ScraperControl (NOT `migrate`) 3. Rebuild Docker containers if needed --- ## Tech Stack | Layer | Technology | |-------|------------| | Admin UI | Next.js 16, React 19, Tailwind CSS v4 | | Database | Shared NAS PostgreSQL (192.168.0.145:5434) | | ORM | Prisma 7 (`@prisma/adapter-pg` + `pg` Pool) | | Web Scraping | Playwright (headless Chromium) | | Vector DB | ChromaDB (192.168.0.145:8000) | | Embeddings | Ollama on MacBook (192.168.0.75:11434) with nomic-embed-text | | Scheduling | node-cron + database-driven job queue | | Containerization | Docker, Docker Compose | --- ## Project Structure ``` src/ ├── app/ # Next.js Admin Dashboard (port 3001) │ ├── page.tsx # Main dashboard (Jobs, Scrapes, Search tabs) │ └── api/admin/ # Admin API routes │ ├── jobs/ # Job management (GET/POST/PATCH) │ ├── scrape-log/ # Recently scraped churches log │ └── freesearch-log/ # FreeSearch results log │ ├── chromadb/ # ChromaDB integration │ ├── client.ts # ChromaDB client singleton │ ├── embeddings.ts # OpenAI-compatible embedding helper (Ollama) │ ├── collections.ts # Collection definitions (5 collections) │ └── queries.ts # Query helpers per use case │ ├── lib/ # Core business logic │ ├── db.ts # Prisma client singleton │ ├── admin-auth.ts # Timing-safe API key auth │ ├── geo.ts # Haversine distance (minimal) │ ├── scraper-service.ts # Scraper orchestration │ ├── overpass-client.ts # OpenStreetMap Overpass API │ ├── church-matcher.ts # Church matching/dedup │ └── masstimes-scraper.ts # MassTimes.org integration │ └── scrapers/ # Web scraping system ├── base-scraper.ts # Base class ├── index.ts # Exports ├── registry.ts # Strategy registry ├── url-discovery.ts # Mass schedule URL finder ├── strategies/ # Language-specific scrapers │ ├── generic.ts # Fallback (10+ languages) │ ├── english.ts │ ├── french.ts │ ├── german.ts │ ├── italian.ts │ └── spanish.ts └── i18n/ # Internationalization ├── day-names.ts # Day name patterns per language └── day-ranges.ts # Day range parsing ("Monday-Friday") scripts/ # CLI scripts ├── scrape-churches.ts # Scrape churches by language ├── scrape-masstimes.ts # Scrape from MassTimes.org ├── import-osm-churches.ts # Import from OpenStreetMap ├── import-osm-region.ts # Import specific OSM region ├── enrich-with-google-places.ts # Google Places enrichment ├── enrich-with-freesearch.ts # FreeSearch website enrichment ├── enrich-with-reverse-geocode.ts # Reverse geocode enrichment ├── scheduler.ts # Background job scheduler ├── dedup-mass-schedules.ts # Mass schedule deduplication ├── dedup-churches.ts # Church dedup via ChromaDB ├── transfer-enriched-to-neon.ts # NAS → Neon production sync ├── populate-chromadb.ts # Bulk-populate ChromaDB collections ├── populate-city-normalized.ts ├── save-schedules-to-db.ts ├── test-scraper.ts # Test scraper on a URL ├── test-url-discovery.ts # Test URL discovery ├── test-edge-cases.ts # International edge case tests └── debug/ # Debug/investigation scripts (~44 files) ``` --- ## Common Commands ```bash # === DEVELOPMENT === npm run dev # Start admin dashboard (localhost:3001) npm run build # Build Next.js app # === SCRAPING === npm run scrape:churches # Scrape churches (pass --language, --all flags) npm run scrape:masstimes # Scrape from MassTimes.org npm run test:scraper # Test scraper on a URL npm run test:discover # Test URL discovery # === ENRICHMENT === npm run enrich:places # Google Places enrichment npm run enrich:freesearch # FreeSearch website enrichment # === DATA MANAGEMENT === npm run dedup:masses # Deduplicate mass schedules npm run import:osm # Import churches from OpenStreetMap npm run transfer:neon # Transfer enriched data to Neon production npm run scheduler # Start background job scheduler # === CHROMADB === npx tsx scripts/populate-chromadb.ts --all # Populate all collections npx tsx scripts/populate-chromadb.ts --collection church_identity # Single collection npx tsx scripts/dedup-churches.ts --threshold 0.15 # Find duplicate churches # === DOCKER (on NAS) === docker compose build scraper # Build scraper image docker compose --profile tools run --rm scraper # Run one-off scraper docker compose up -d scheduler freesearch-enrichment # Start background services ``` --- ## ChromaDB Integration ### Collections | Collection | Purpose | Documents | |---|---|---| | `church_identity` | Deduplication | `{name} {address} {city} {country}` | | `search_results` | FreeSearch matching | `{title} {snippet} {url}` | | `page_classification` | Content classification | Page text (first 2000 chars) | | `schedule_sections` | Schedule detection | Text blocks with mass times | | `page_snapshots` | Change detection | Full page text | ### Infrastructure - **ChromaDB server**: `http://192.168.0.145:8000` (on NAS) - **Embedding API**: `http://192.168.0.75:11434/v1` (Ollama on MacBook M1) - **Embedding model**: `nomic-embed-text` (~270MB, fast on M1) ### Prerequisite Ollama must be running on the MacBook with LAN access enabled: ```bash OLLAMA_HOST=0.0.0.0 ollama serve ollama pull nomic-embed-text ``` --- ## Docker Services | Service | Profile | Purpose | |---|---|---| | app | (default) | Admin dashboard on port 3001 | | scraper | tools | Generic scraper (on-demand) | | scraper-english | scraper-english | English language scraper | | scraper-french | scraper-french | French language scraper | | scraper-german | scraper-german | German language scraper | | scraper-italian | scraper-italian | Italian language scraper | | scraper-spanish | scraper-spanish | Spanish language scraper | | scraper-generic | scraper-generic | Generic fallback scraper | | scheduler | (default) | Background job scheduler | | freesearch-enrichment | (default) | FreeSearch enrichment daemon | --- ## Environment Variables ```env DATABASE_URL="postgresql://postgres:postgres@192.168.0.145:5434/nearestmass" ADMIN_API_KEY=your-secret-key CHROMADB_URL=http://192.168.0.145:8000 EMBEDDING_API_URL=http://192.168.0.75:11434/v1 EMBEDDING_MODEL=nomic-embed-text GOOGLE_PLACES_API_KEY=your-google-key FREESEARCH_URL=http://192.168.0.145:3111 ``` --- ## NAS Deployment ScraperControl is deployed on the Synology NAS at `/volume1/docker/scraper-control/`. ### Container Layout | Container | Purpose | Port | |-----------|---------|------| | scraper-control-app-1 | Admin dashboard | 3001 | | scraper-control-scheduler-1 | Job scheduler | - | | scraper-control-freesearch-enrichment-1 | FreeSearch daemon | - | The `db` container (`nearestmass-db-1`) is managed by BethelGuide's compose file at `/volume1/docker/nearestmass/`. ScraperControl joins the same `nearestmass_default` external Docker network — no `depends_on` allowed since `db` is in a different compose file. ### Deploying Updates ```bash # From local machine: bash scripts/deploy-to-nas.sh # Or manually: rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \ /Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/ ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scraper && /usr/local/bin/docker compose up -d scheduler freesearch-enrichment' ``` ### Rebuilding Admin Dashboard ```bash ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build app && /usr/local/bin/docker compose up -d app' ``` ### Important Notes - **DO NOT** add `depends_on: db` to any service — `db` is in BethelGuide's compose file - The `.env` on NAS uses host IP (`192.168.0.145:5434`) for scripts run outside Docker - The `docker-compose.yml` environment overrides use `db:5432` (Docker DNS via shared network) - Docker binary on NAS is at `/usr/local/bin/docker` ### NAS Docker Health The Synology NAS (4 CPU, 17GB RAM) runs 23 containers across 7 projects. Church project containers (5) all have memory limits and log rotation. See `memory/nas-docker-health.md` for full inventory. **Scheduler hardening**: Uses `detached: true` + process group kill to prevent orphaned Chromium processes, `init: true` for zombie reaping, 24h job timeout, 8GB memory limit. **Maintenance**: Docker is on /volume1 (15TB free). Run `docker builder prune -f` occasionally to keep build cache tidy.