Files

Albert 2c51513851 chore: sync with Gitea master and restore local-only files

Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-12 19:11:22 -04:00

11 KiB

Raw Blame History

Role in Ecosystem

ScraperControl is the data pipeline for the Church project — handling scraping, enrichment, ChromaDB semantic search, and data transfer to Neon production. It runs on the Synology NAS (Docker), not Vercel.

Schema sync: Handled by npm run sync from the Church/ root directory. No need to manually copy schema files.
Coordinated deployment: Use npm run deploy from Church/ root for full pipeline deployment.
Schema source of truth: BethelGuide — never run prisma migrate in ScraperControl.

Claude Instructions for ScraperControl

Project Overview

ScraperControl is the scraping, enrichment, and data management backend for the NearestMass church finder. It provides:

Admin Dashboard (Next.js): Job management UI at port 3001
Web Scrapers: Playwright-based scrapers for extracting mass schedules from church websites
Enrichment Pipelines: Google Places, FreeSearch, reverse geocode enrichment
ChromaDB Integration: Semantic search for deduplication, content classification, and change detection
Scheduler: Database-driven job queue for automated scraping

Shared Database Architecture

ScraperControl and BethelGuide share the same NAS PostgreSQL database (192.168.0.145:5434). BethelGuide is the schema source of truth. After any schema change in BethelGuide:

Copy BethelGuide/prisma/schema.prisma → ScraperControl/prisma/schema.prisma
Run npx prisma generate in ScraperControl (NOT migrate)
Rebuild Docker containers if needed

Tech Stack

Layer	Technology
Admin UI	Next.js 16, React 19, Tailwind CSS v4
Database	Shared NAS PostgreSQL (192.168.0.145:5434)
ORM	Prisma 7 (`@prisma/adapter-pg` + `pg` Pool)
Web Scraping	Playwright (headless Chromium)
Vector DB	ChromaDB (192.168.0.145:8000)
Embeddings	Ollama on MacBook (192.168.0.75:11434) with nomic-embed-text
Scheduling	node-cron + database-driven job queue
Containerization	Docker, Docker Compose

Project Structure

src/
├── app/                      # Next.js Admin Dashboard (port 3001)
│   ├── page.tsx              # Main dashboard (Jobs, Scrapes, Search tabs)
│   └── api/admin/            # Admin API routes
│       ├── jobs/             # Job management (GET/POST/PATCH)
│       ├── scrape-log/       # Recently scraped churches log
│       └── freesearch-log/   # FreeSearch results log
│
├── chromadb/                 # ChromaDB integration
│   ├── client.ts             # ChromaDB client singleton
│   ├── embeddings.ts         # OpenAI-compatible embedding helper (Ollama)
│   ├── collections.ts        # Collection definitions (5 collections)
│   └── queries.ts            # Query helpers per use case
│
├── lib/                      # Core business logic
│   ├── db.ts                 # Prisma client singleton
│   ├── admin-auth.ts         # Timing-safe API key auth
│   ├── geo.ts                # Haversine distance (minimal)
│   ├── scraper-service.ts    # Scraper orchestration
│   ├── overpass-client.ts    # OpenStreetMap Overpass API
│   ├── church-matcher.ts     # Church matching/dedup
│   └── masstimes-scraper.ts  # MassTimes.org integration
│
└── scrapers/                 # Web scraping system
    ├── base-scraper.ts       # Base class
    ├── index.ts              # Exports
    ├── registry.ts           # Strategy registry
    ├── url-discovery.ts      # Mass schedule URL finder
    ├── strategies/           # Language-specific scrapers
    │   ├── generic.ts        # Fallback (10+ languages)
    │   ├── english.ts
    │   ├── french.ts
    │   ├── german.ts
    │   ├── italian.ts
    │   └── spanish.ts
    └── i18n/                 # Internationalization
        ├── day-names.ts      # Day name patterns per language
        └── day-ranges.ts     # Day range parsing ("Monday-Friday")

scripts/                      # CLI scripts
├── scrape-churches.ts        # Scrape churches by language
├── scrape-masstimes.ts       # Scrape from MassTimes.org
├── import-osm-churches.ts    # Import from OpenStreetMap
├── import-osm-region.ts      # Import specific OSM region
├── enrich-with-google-places.ts  # Google Places enrichment
├── enrich-with-freesearch.ts     # FreeSearch website enrichment
├── enrich-with-reverse-geocode.ts # Reverse geocode enrichment
├── scheduler.ts              # Background job scheduler
├── dedup-mass-schedules.ts   # Mass schedule deduplication
├── dedup-churches.ts         # Church dedup via ChromaDB
├── transfer-enriched-to-neon.ts  # NAS → Neon production sync
├── populate-chromadb.ts      # Bulk-populate ChromaDB collections
├── populate-city-normalized.ts
├── save-schedules-to-db.ts
├── test-scraper.ts           # Test scraper on a URL
├── test-url-discovery.ts     # Test URL discovery
├── test-edge-cases.ts        # International edge case tests
└── debug/                    # Debug/investigation scripts (~44 files)

Common Commands

# === DEVELOPMENT ===
npm run dev              # Start admin dashboard (localhost:3001)
npm run build            # Build Next.js app

# === SCRAPING ===
npm run scrape:churches  # Scrape churches (pass --language, --all flags)
npm run scrape:masstimes # Scrape from MassTimes.org
npm run test:scraper     # Test scraper on a URL
npm run test:discover    # Test URL discovery

# === ENRICHMENT ===
npm run enrich:places    # Google Places enrichment
npm run enrich:freesearch # FreeSearch website enrichment

# === DATA MANAGEMENT ===
npm run dedup:masses     # Deduplicate mass schedules
npm run import:osm       # Import churches from OpenStreetMap
npm run transfer:neon    # Transfer enriched data to Neon production
npm run scheduler        # Start background job scheduler

# === CHROMADB ===
npx tsx scripts/populate-chromadb.ts --all                    # Populate all collections
npx tsx scripts/populate-chromadb.ts --collection church_identity  # Single collection
npx tsx scripts/dedup-churches.ts --threshold 0.15            # Find duplicate churches

# === DOCKER (on NAS) ===
docker compose build scraper                                   # Build scraper image
docker compose --profile tools run --rm scraper <command>      # Run one-off scraper
docker compose up -d scheduler freesearch-enrichment           # Start background services

ChromaDB Integration

Collections

Collection	Purpose	Documents
`church_identity`	Deduplication	`{name} {address} {city} {country}`
`search_results`	FreeSearch matching	`{title} {snippet} {url}`
`page_classification`	Content classification	Page text (first 2000 chars)
`schedule_sections`	Schedule detection	Text blocks with mass times
`page_snapshots`	Change detection	Full page text

Infrastructure

ChromaDB server: http://192.168.0.145:8000 (on NAS)
Embedding API: http://192.168.0.75:11434/v1 (Ollama on MacBook M1)
Embedding model: nomic-embed-text (~270MB, fast on M1)

Prerequisite

Ollama must be running on the MacBook with LAN access enabled:

OLLAMA_HOST=0.0.0.0 ollama serve
ollama pull nomic-embed-text

Docker Services

Service	Profile	Purpose
app	(default)	Admin dashboard on port 3001
scraper	tools	Generic scraper (on-demand)
scraper-english	scraper-english	English language scraper
scraper-french	scraper-french	French language scraper
scraper-german	scraper-german	German language scraper
scraper-italian	scraper-italian	Italian language scraper
scraper-spanish	scraper-spanish	Spanish language scraper
scraper-generic	scraper-generic	Generic fallback scraper
scheduler	(default)	Background job scheduler
freesearch-enrichment	(default)	FreeSearch enrichment daemon

Environment Variables

DATABASE_URL="postgresql://postgres:postgres@192.168.0.145:5434/nearestmass"
ADMIN_API_KEY=your-secret-key
CHROMADB_URL=http://192.168.0.145:8000
EMBEDDING_API_URL=http://192.168.0.75:11434/v1
EMBEDDING_MODEL=nomic-embed-text
GOOGLE_PLACES_API_KEY=your-google-key
FREESEARCH_URL=http://192.168.0.145:3111

NAS Deployment

ScraperControl is deployed on the Synology NAS at /volume1/docker/scraper-control/.

Container Layout

Container	Purpose	Port
scraper-control-app-1	Admin dashboard	3001
scraper-control-scheduler-1	Job scheduler	-
scraper-control-freesearch-enrichment-1	FreeSearch daemon	-

The db container (nearestmass-db-1) is managed by BethelGuide's compose file at /volume1/docker/nearestmass/. ScraperControl joins the same nearestmass_default external Docker network — no depends_on allowed since db is in a different compose file.

Deploying Updates

# From local machine:
bash scripts/deploy-to-nas.sh

# Or manually:
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
  /Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/

ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scraper && /usr/local/bin/docker compose up -d scheduler freesearch-enrichment'

Rebuilding Admin Dashboard

ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build app && /usr/local/bin/docker compose up -d app'

Important Notes

DO NOT add depends_on: db to any service — db is in BethelGuide's compose file
The .env on NAS uses host IP (192.168.0.145:5434) for scripts run outside Docker
The docker-compose.yml environment overrides use db:5432 (Docker DNS via shared network)
Docker binary on NAS is at /usr/local/bin/docker

NAS Docker Health

The Synology NAS (4 CPU, 17GB RAM) runs 23 containers across 7 projects. Church project containers (5) all have memory limits and log rotation. See memory/nas-docker-health.md for full inventory.

Scheduler hardening: Uses detached: true + process group kill to prevent orphaned Chromium processes, init: true for zombie reaping, 24h job timeout, 8GB memory limit.

Maintenance: Docker is on /volume1 (15TB free). Run docker builder prune -f occasionally to keep build cache tidy.

11 KiB Raw Blame History