Files
ScraperControl/CLAUDE.md
Albert 2c51513851 chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-12 19:11:22 -04:00

11 KiB

Role in Ecosystem

ScraperControl is the data pipeline for the Church project — handling scraping, enrichment, ChromaDB semantic search, and data transfer to Neon production. It runs on the Synology NAS (Docker), not Vercel.

  • Schema sync: Handled by npm run sync from the Church/ root directory. No need to manually copy schema files.
  • Coordinated deployment: Use npm run deploy from Church/ root for full pipeline deployment.
  • Schema source of truth: BethelGuide — never run prisma migrate in ScraperControl.

Claude Instructions for ScraperControl

Project Overview

ScraperControl is the scraping, enrichment, and data management backend for the NearestMass church finder. It provides:

  1. Admin Dashboard (Next.js): Job management UI at port 3001
  2. Web Scrapers: Playwright-based scrapers for extracting mass schedules from church websites
  3. Enrichment Pipelines: Google Places, FreeSearch, reverse geocode enrichment
  4. ChromaDB Integration: Semantic search for deduplication, content classification, and change detection
  5. Scheduler: Database-driven job queue for automated scraping

Shared Database Architecture

ScraperControl and BethelGuide share the same NAS PostgreSQL database (192.168.0.145:5434). BethelGuide is the schema source of truth. After any schema change in BethelGuide:

  1. Copy BethelGuide/prisma/schema.prismaScraperControl/prisma/schema.prisma
  2. Run npx prisma generate in ScraperControl (NOT migrate)
  3. Rebuild Docker containers if needed

Tech Stack

Layer Technology
Admin UI Next.js 16, React 19, Tailwind CSS v4
Database Shared NAS PostgreSQL (192.168.0.145:5434)
ORM Prisma 7 (@prisma/adapter-pg + pg Pool)
Web Scraping Playwright (headless Chromium)
Vector DB ChromaDB (192.168.0.145:8000)
Embeddings Ollama on MacBook (192.168.0.75:11434) with nomic-embed-text
Scheduling node-cron + database-driven job queue
Containerization Docker, Docker Compose

Project Structure

src/
├── app/                      # Next.js Admin Dashboard (port 3001)
│   ├── page.tsx              # Main dashboard (Jobs, Scrapes, Search tabs)
│   └── api/admin/            # Admin API routes
│       ├── jobs/             # Job management (GET/POST/PATCH)
│       ├── scrape-log/       # Recently scraped churches log
│       └── freesearch-log/   # FreeSearch results log
│
├── chromadb/                 # ChromaDB integration
│   ├── client.ts             # ChromaDB client singleton
│   ├── embeddings.ts         # OpenAI-compatible embedding helper (Ollama)
│   ├── collections.ts        # Collection definitions (5 collections)
│   └── queries.ts            # Query helpers per use case
│
├── lib/                      # Core business logic
│   ├── db.ts                 # Prisma client singleton
│   ├── admin-auth.ts         # Timing-safe API key auth
│   ├── geo.ts                # Haversine distance (minimal)
│   ├── scraper-service.ts    # Scraper orchestration
│   ├── overpass-client.ts    # OpenStreetMap Overpass API
│   ├── church-matcher.ts     # Church matching/dedup
│   └── masstimes-scraper.ts  # MassTimes.org integration
│
└── scrapers/                 # Web scraping system
    ├── base-scraper.ts       # Base class
    ├── index.ts              # Exports
    ├── registry.ts           # Strategy registry
    ├── url-discovery.ts      # Mass schedule URL finder
    ├── strategies/           # Language-specific scrapers
    │   ├── generic.ts        # Fallback (10+ languages)
    │   ├── english.ts
    │   ├── french.ts
    │   ├── german.ts
    │   ├── italian.ts
    │   └── spanish.ts
    └── i18n/                 # Internationalization
        ├── day-names.ts      # Day name patterns per language
        └── day-ranges.ts     # Day range parsing ("Monday-Friday")

scripts/                      # CLI scripts
├── scrape-churches.ts        # Scrape churches by language
├── scrape-masstimes.ts       # Scrape from MassTimes.org
├── import-osm-churches.ts    # Import from OpenStreetMap
├── import-osm-region.ts      # Import specific OSM region
├── enrich-with-google-places.ts  # Google Places enrichment
├── enrich-with-freesearch.ts     # FreeSearch website enrichment
├── enrich-with-reverse-geocode.ts # Reverse geocode enrichment
├── scheduler.ts              # Background job scheduler
├── dedup-mass-schedules.ts   # Mass schedule deduplication
├── dedup-churches.ts         # Church dedup via ChromaDB
├── transfer-enriched-to-neon.ts  # NAS → Neon production sync
├── populate-chromadb.ts      # Bulk-populate ChromaDB collections
├── populate-city-normalized.ts
├── save-schedules-to-db.ts
├── test-scraper.ts           # Test scraper on a URL
├── test-url-discovery.ts     # Test URL discovery
├── test-edge-cases.ts        # International edge case tests
└── debug/                    # Debug/investigation scripts (~44 files)

Common Commands

# === DEVELOPMENT ===
npm run dev              # Start admin dashboard (localhost:3001)
npm run build            # Build Next.js app

# === SCRAPING ===
npm run scrape:churches  # Scrape churches (pass --language, --all flags)
npm run scrape:masstimes # Scrape from MassTimes.org
npm run test:scraper     # Test scraper on a URL
npm run test:discover    # Test URL discovery

# === ENRICHMENT ===
npm run enrich:places    # Google Places enrichment
npm run enrich:freesearch # FreeSearch website enrichment

# === DATA MANAGEMENT ===
npm run dedup:masses     # Deduplicate mass schedules
npm run import:osm       # Import churches from OpenStreetMap
npm run transfer:neon    # Transfer enriched data to Neon production
npm run scheduler        # Start background job scheduler

# === CHROMADB ===
npx tsx scripts/populate-chromadb.ts --all                    # Populate all collections
npx tsx scripts/populate-chromadb.ts --collection church_identity  # Single collection
npx tsx scripts/dedup-churches.ts --threshold 0.15            # Find duplicate churches

# === DOCKER (on NAS) ===
docker compose build scraper                                   # Build scraper image
docker compose --profile tools run --rm scraper <command>      # Run one-off scraper
docker compose up -d scheduler freesearch-enrichment           # Start background services

ChromaDB Integration

Collections

Collection Purpose Documents
church_identity Deduplication {name} {address} {city} {country}
search_results FreeSearch matching {title} {snippet} {url}
page_classification Content classification Page text (first 2000 chars)
schedule_sections Schedule detection Text blocks with mass times
page_snapshots Change detection Full page text

Infrastructure

  • ChromaDB server: http://192.168.0.145:8000 (on NAS)
  • Embedding API: http://192.168.0.75:11434/v1 (Ollama on MacBook M1)
  • Embedding model: nomic-embed-text (~270MB, fast on M1)

Prerequisite

Ollama must be running on the MacBook with LAN access enabled:

OLLAMA_HOST=0.0.0.0 ollama serve
ollama pull nomic-embed-text

Docker Services

Service Profile Purpose
app (default) Admin dashboard on port 3001
scraper tools Generic scraper (on-demand)
scraper-english scraper-english English language scraper
scraper-french scraper-french French language scraper
scraper-german scraper-german German language scraper
scraper-italian scraper-italian Italian language scraper
scraper-spanish scraper-spanish Spanish language scraper
scraper-generic scraper-generic Generic fallback scraper
scheduler (default) Background job scheduler
freesearch-enrichment (default) FreeSearch enrichment daemon

Environment Variables

DATABASE_URL="postgresql://postgres:postgres@192.168.0.145:5434/nearestmass"
ADMIN_API_KEY=your-secret-key
CHROMADB_URL=http://192.168.0.145:8000
EMBEDDING_API_URL=http://192.168.0.75:11434/v1
EMBEDDING_MODEL=nomic-embed-text
GOOGLE_PLACES_API_KEY=your-google-key
FREESEARCH_URL=http://192.168.0.145:3111

NAS Deployment

ScraperControl is deployed on the Synology NAS at /volume1/docker/scraper-control/.

Container Layout

Container Purpose Port
scraper-control-app-1 Admin dashboard 3001
scraper-control-scheduler-1 Job scheduler -
scraper-control-freesearch-enrichment-1 FreeSearch daemon -

The db container (nearestmass-db-1) is managed by BethelGuide's compose file at /volume1/docker/nearestmass/. ScraperControl joins the same nearestmass_default external Docker network — no depends_on allowed since db is in a different compose file.

Deploying Updates

# From local machine:
bash scripts/deploy-to-nas.sh

# Or manually:
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
  /Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/

ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scraper && /usr/local/bin/docker compose up -d scheduler freesearch-enrichment'

Rebuilding Admin Dashboard

ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build app && /usr/local/bin/docker compose up -d app'

Important Notes

  • DO NOT add depends_on: db to any service — db is in BethelGuide's compose file
  • The .env on NAS uses host IP (192.168.0.145:5434) for scripts run outside Docker
  • The docker-compose.yml environment overrides use db:5432 (Docker DNS via shared network)
  • Docker binary on NAS is at /usr/local/bin/docker

NAS Docker Health

The Synology NAS (4 CPU, 17GB RAM) runs 23 containers across 7 projects. Church project containers (5) all have memory limits and log rotation. See memory/nas-docker-health.md for full inventory.

Scheduler hardening: Uses detached: true + process group kill to prevent orphaned Chromium processes, init: true for zombie reaping, 24h job timeout, 8GB memory limit.

Maintenance: Docker is on /volume1 (15TB free). Run docker builder prune -f occasionally to keep build cache tidy.