Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11 KiB
Role in Ecosystem
ScraperControl is the data pipeline for the Church project — handling scraping, enrichment, ChromaDB semantic search, and data transfer to Neon production. It runs on the Synology NAS (Docker), not Vercel.
- Schema sync: Handled by
npm run syncfrom theChurch/root directory. No need to manually copy schema files. - Coordinated deployment: Use
npm run deployfromChurch/root for full pipeline deployment. - Schema source of truth: BethelGuide — never run
prisma migratein ScraperControl.
Claude Instructions for ScraperControl
Project Overview
ScraperControl is the scraping, enrichment, and data management backend for the NearestMass church finder. It provides:
- Admin Dashboard (Next.js): Job management UI at port 3001
- Web Scrapers: Playwright-based scrapers for extracting mass schedules from church websites
- Enrichment Pipelines: Google Places, FreeSearch, reverse geocode enrichment
- ChromaDB Integration: Semantic search for deduplication, content classification, and change detection
- Scheduler: Database-driven job queue for automated scraping
Shared Database Architecture
ScraperControl and BethelGuide share the same NAS PostgreSQL database (192.168.0.145:5434). BethelGuide is the schema source of truth. After any schema change in BethelGuide:
- Copy
BethelGuide/prisma/schema.prisma→ScraperControl/prisma/schema.prisma - Run
npx prisma generatein ScraperControl (NOTmigrate) - Rebuild Docker containers if needed
Tech Stack
| Layer | Technology |
|---|---|
| Admin UI | Next.js 16, React 19, Tailwind CSS v4 |
| Database | Shared NAS PostgreSQL (192.168.0.145:5434) |
| ORM | Prisma 7 (@prisma/adapter-pg + pg Pool) |
| Web Scraping | Playwright (headless Chromium) |
| Vector DB | ChromaDB (192.168.0.145:8000) |
| Embeddings | Ollama on MacBook (192.168.0.75:11434) with nomic-embed-text |
| Scheduling | node-cron + database-driven job queue |
| Containerization | Docker, Docker Compose |
Project Structure
src/
├── app/ # Next.js Admin Dashboard (port 3001)
│ ├── page.tsx # Main dashboard (Jobs, Scrapes, Search tabs)
│ └── api/admin/ # Admin API routes
│ ├── jobs/ # Job management (GET/POST/PATCH)
│ ├── scrape-log/ # Recently scraped churches log
│ └── freesearch-log/ # FreeSearch results log
│
├── chromadb/ # ChromaDB integration
│ ├── client.ts # ChromaDB client singleton
│ ├── embeddings.ts # OpenAI-compatible embedding helper (Ollama)
│ ├── collections.ts # Collection definitions (5 collections)
│ └── queries.ts # Query helpers per use case
│
├── lib/ # Core business logic
│ ├── db.ts # Prisma client singleton
│ ├── admin-auth.ts # Timing-safe API key auth
│ ├── geo.ts # Haversine distance (minimal)
│ ├── scraper-service.ts # Scraper orchestration
│ ├── overpass-client.ts # OpenStreetMap Overpass API
│ ├── church-matcher.ts # Church matching/dedup
│ └── masstimes-scraper.ts # MassTimes.org integration
│
└── scrapers/ # Web scraping system
├── base-scraper.ts # Base class
├── index.ts # Exports
├── registry.ts # Strategy registry
├── url-discovery.ts # Mass schedule URL finder
├── strategies/ # Language-specific scrapers
│ ├── generic.ts # Fallback (10+ languages)
│ ├── english.ts
│ ├── french.ts
│ ├── german.ts
│ ├── italian.ts
│ └── spanish.ts
└── i18n/ # Internationalization
├── day-names.ts # Day name patterns per language
└── day-ranges.ts # Day range parsing ("Monday-Friday")
scripts/ # CLI scripts
├── scrape-churches.ts # Scrape churches by language
├── scrape-masstimes.ts # Scrape from MassTimes.org
├── import-osm-churches.ts # Import from OpenStreetMap
├── import-osm-region.ts # Import specific OSM region
├── enrich-with-google-places.ts # Google Places enrichment
├── enrich-with-freesearch.ts # FreeSearch website enrichment
├── enrich-with-reverse-geocode.ts # Reverse geocode enrichment
├── scheduler.ts # Background job scheduler
├── dedup-mass-schedules.ts # Mass schedule deduplication
├── dedup-churches.ts # Church dedup via ChromaDB
├── transfer-enriched-to-neon.ts # NAS → Neon production sync
├── populate-chromadb.ts # Bulk-populate ChromaDB collections
├── populate-city-normalized.ts
├── save-schedules-to-db.ts
├── test-scraper.ts # Test scraper on a URL
├── test-url-discovery.ts # Test URL discovery
├── test-edge-cases.ts # International edge case tests
└── debug/ # Debug/investigation scripts (~44 files)
Common Commands
# === DEVELOPMENT ===
npm run dev # Start admin dashboard (localhost:3001)
npm run build # Build Next.js app
# === SCRAPING ===
npm run scrape:churches # Scrape churches (pass --language, --all flags)
npm run scrape:masstimes # Scrape from MassTimes.org
npm run test:scraper # Test scraper on a URL
npm run test:discover # Test URL discovery
# === ENRICHMENT ===
npm run enrich:places # Google Places enrichment
npm run enrich:freesearch # FreeSearch website enrichment
# === DATA MANAGEMENT ===
npm run dedup:masses # Deduplicate mass schedules
npm run import:osm # Import churches from OpenStreetMap
npm run transfer:neon # Transfer enriched data to Neon production
npm run scheduler # Start background job scheduler
# === CHROMADB ===
npx tsx scripts/populate-chromadb.ts --all # Populate all collections
npx tsx scripts/populate-chromadb.ts --collection church_identity # Single collection
npx tsx scripts/dedup-churches.ts --threshold 0.15 # Find duplicate churches
# === DOCKER (on NAS) ===
docker compose build scraper # Build scraper image
docker compose --profile tools run --rm scraper <command> # Run one-off scraper
docker compose up -d scheduler freesearch-enrichment # Start background services
ChromaDB Integration
Collections
| Collection | Purpose | Documents |
|---|---|---|
church_identity |
Deduplication | {name} {address} {city} {country} |
search_results |
FreeSearch matching | {title} {snippet} {url} |
page_classification |
Content classification | Page text (first 2000 chars) |
schedule_sections |
Schedule detection | Text blocks with mass times |
page_snapshots |
Change detection | Full page text |
Infrastructure
- ChromaDB server:
http://192.168.0.145:8000(on NAS) - Embedding API:
http://192.168.0.75:11434/v1(Ollama on MacBook M1) - Embedding model:
nomic-embed-text(~270MB, fast on M1)
Prerequisite
Ollama must be running on the MacBook with LAN access enabled:
OLLAMA_HOST=0.0.0.0 ollama serve
ollama pull nomic-embed-text
Docker Services
| Service | Profile | Purpose |
|---|---|---|
| app | (default) | Admin dashboard on port 3001 |
| scraper | tools | Generic scraper (on-demand) |
| scraper-english | scraper-english | English language scraper |
| scraper-french | scraper-french | French language scraper |
| scraper-german | scraper-german | German language scraper |
| scraper-italian | scraper-italian | Italian language scraper |
| scraper-spanish | scraper-spanish | Spanish language scraper |
| scraper-generic | scraper-generic | Generic fallback scraper |
| scheduler | (default) | Background job scheduler |
| freesearch-enrichment | (default) | FreeSearch enrichment daemon |
Environment Variables
DATABASE_URL="postgresql://postgres:postgres@192.168.0.145:5434/nearestmass"
ADMIN_API_KEY=your-secret-key
CHROMADB_URL=http://192.168.0.145:8000
EMBEDDING_API_URL=http://192.168.0.75:11434/v1
EMBEDDING_MODEL=nomic-embed-text
GOOGLE_PLACES_API_KEY=your-google-key
FREESEARCH_URL=http://192.168.0.145:3111
NAS Deployment
ScraperControl is deployed on the Synology NAS at /volume1/docker/scraper-control/.
Container Layout
| Container | Purpose | Port |
|---|---|---|
| scraper-control-app-1 | Admin dashboard | 3001 |
| scraper-control-scheduler-1 | Job scheduler | - |
| scraper-control-freesearch-enrichment-1 | FreeSearch daemon | - |
The db container (nearestmass-db-1) is managed by BethelGuide's compose file at /volume1/docker/nearestmass/. ScraperControl joins the same nearestmass_default external Docker network — no depends_on allowed since db is in a different compose file.
Deploying Updates
# From local machine:
bash scripts/deploy-to-nas.sh
# Or manually:
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
/Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scraper && /usr/local/bin/docker compose up -d scheduler freesearch-enrichment'
Rebuilding Admin Dashboard
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build app && /usr/local/bin/docker compose up -d app'
Important Notes
- DO NOT add
depends_on: dbto any service —dbis in BethelGuide's compose file - The
.envon NAS uses host IP (192.168.0.145:5434) for scripts run outside Docker - The
docker-compose.ymlenvironment overrides usedb:5432(Docker DNS via shared network) - Docker binary on NAS is at
/usr/local/bin/docker
NAS Docker Health
The Synology NAS (4 CPU, 17GB RAM) runs 23 containers across 7 projects. Church project containers (5) all have memory limits and log rotation. See memory/nas-docker-health.md for full inventory.
Scheduler hardening: Uses detached: true + process group kill to prevent orphaned Chromium processes, init: true for zombie reaping, 24h job timeout, 8GB memory limit.
Maintenance: Docker is on /volume1 (15TB free). Run docker builder prune -f occasionally to keep build cache tidy.