Files
ScraperControl/CLAUDE.md

255 lines
11 KiB
Markdown
Raw Normal View History

# Role in Ecosystem
**ScraperControl** is the data pipeline for the Church project — handling scraping, enrichment, ChromaDB semantic search, and data transfer to Neon production. It runs on the Synology NAS (Docker), not Vercel.
- **Schema sync**: Handled by `npm run sync` from the `Church/` root directory. No need to manually copy schema files.
- **Coordinated deployment**: Use `npm run deploy` from `Church/` root for full pipeline deployment.
- **Schema source of truth**: BethelGuide — never run `prisma migrate` in ScraperControl.
---
# Claude Instructions for ScraperControl
## Project Overview
**ScraperControl** is the scraping, enrichment, and data management backend for the NearestMass church finder. It provides:
1. **Admin Dashboard** (Next.js): Job management UI at port 3001
2. **Web Scrapers**: Playwright-based scrapers for extracting mass schedules from church websites
3. **Enrichment Pipelines**: Google Places, FreeSearch, reverse geocode enrichment
4. **ChromaDB Integration**: Semantic search for deduplication, content classification, and change detection
5. **Scheduler**: Database-driven job queue for automated scraping
### Shared Database Architecture
ScraperControl and BethelGuide share the **same NAS PostgreSQL database** (192.168.0.145:5434). BethelGuide is the **schema source of truth**. After any schema change in BethelGuide:
1. Copy `BethelGuide/prisma/schema.prisma``ScraperControl/prisma/schema.prisma`
2. Run `npx prisma generate` in ScraperControl (NOT `migrate`)
3. Rebuild Docker containers if needed
---
## Tech Stack
| Layer | Technology |
|-------|------------|
| Admin UI | Next.js 16, React 19, Tailwind CSS v4 |
| Database | Shared NAS PostgreSQL (192.168.0.145:5434) |
| ORM | Prisma 7 (`@prisma/adapter-pg` + `pg` Pool) |
| Web Scraping | Playwright (headless Chromium) |
| Vector DB | ChromaDB (192.168.0.145:8000) |
| Embeddings | Ollama on MacBook (192.168.0.75:11434) with nomic-embed-text |
| Scheduling | node-cron + database-driven job queue |
| Containerization | Docker, Docker Compose |
---
## Project Structure
```
src/
├── app/ # Next.js Admin Dashboard (port 3001)
│ ├── page.tsx # Main dashboard (Jobs, Scrapes, Search tabs)
│ └── api/admin/ # Admin API routes
│ ├── jobs/ # Job management (GET/POST/PATCH)
│ ├── scrape-log/ # Recently scraped churches log
│ └── freesearch-log/ # FreeSearch results log
├── chromadb/ # ChromaDB integration
│ ├── client.ts # ChromaDB client singleton
│ ├── embeddings.ts # OpenAI-compatible embedding helper (Ollama)
│ ├── collections.ts # Collection definitions (5 collections)
│ └── queries.ts # Query helpers per use case
├── lib/ # Core business logic
│ ├── db.ts # Prisma client singleton
│ ├── admin-auth.ts # Timing-safe API key auth
│ ├── geo.ts # Haversine distance (minimal)
│ ├── scraper-service.ts # Scraper orchestration
│ ├── overpass-client.ts # OpenStreetMap Overpass API
│ ├── church-matcher.ts # Church matching/dedup
│ └── masstimes-scraper.ts # MassTimes.org integration
└── scrapers/ # Web scraping system
├── base-scraper.ts # Base class
├── index.ts # Exports
├── registry.ts # Strategy registry
├── url-discovery.ts # Mass schedule URL finder
├── strategies/ # Language-specific scrapers
│ ├── generic.ts # Fallback (10+ languages)
│ ├── english.ts
│ ├── french.ts
│ ├── german.ts
│ ├── italian.ts
│ └── spanish.ts
└── i18n/ # Internationalization
├── day-names.ts # Day name patterns per language
└── day-ranges.ts # Day range parsing ("Monday-Friday")
scripts/ # CLI scripts
├── scrape-churches.ts # Scrape churches by language
├── scrape-masstimes.ts # Scrape from MassTimes.org
├── import-osm-churches.ts # Import from OpenStreetMap
├── import-osm-region.ts # Import specific OSM region
├── enrich-with-google-places.ts # Google Places enrichment
├── enrich-with-freesearch.ts # FreeSearch website enrichment
├── enrich-with-reverse-geocode.ts # Reverse geocode enrichment
├── scheduler.ts # Background job scheduler
├── dedup-mass-schedules.ts # Mass schedule deduplication
├── dedup-churches.ts # Church dedup via ChromaDB
├── transfer-enriched-to-neon.ts # NAS → Neon production sync
├── populate-chromadb.ts # Bulk-populate ChromaDB collections
├── populate-city-normalized.ts
├── save-schedules-to-db.ts
├── test-scraper.ts # Test scraper on a URL
├── test-url-discovery.ts # Test URL discovery
├── test-edge-cases.ts # International edge case tests
└── debug/ # Debug/investigation scripts (~44 files)
```
---
## Common Commands
```bash
# === DEVELOPMENT ===
npm run dev # Start admin dashboard (localhost:3001)
npm run build # Build Next.js app
# === SCRAPING ===
npm run scrape:churches # Scrape churches (pass --language, --all flags)
npm run scrape:masstimes # Scrape from MassTimes.org
npm run test:scraper # Test scraper on a URL
npm run test:discover # Test URL discovery
# === ENRICHMENT ===
npm run enrich:places # Google Places enrichment
npm run enrich:freesearch # FreeSearch website enrichment
# === DATA MANAGEMENT ===
npm run dedup:masses # Deduplicate mass schedules
npm run import:osm # Import churches from OpenStreetMap
npm run transfer:neon # Transfer enriched data to Neon production
npm run scheduler # Start background job scheduler
# === CHROMADB ===
npx tsx scripts/populate-chromadb.ts --all # Populate all collections
npx tsx scripts/populate-chromadb.ts --collection church_identity # Single collection
npx tsx scripts/dedup-churches.ts --threshold 0.15 # Find duplicate churches
# === DOCKER (on NAS) ===
docker compose build scraper # Build scraper image
docker compose --profile tools run --rm scraper <command> # Run one-off scraper
docker compose up -d scheduler freesearch-enrichment # Start background services
```
---
## ChromaDB Integration
### Collections
| Collection | Purpose | Documents |
|---|---|---|
| `church_identity` | Deduplication | `{name} {address} {city} {country}` |
| `search_results` | FreeSearch matching | `{title} {snippet} {url}` |
| `page_classification` | Content classification | Page text (first 2000 chars) |
| `schedule_sections` | Schedule detection | Text blocks with mass times |
| `page_snapshots` | Change detection | Full page text |
### Infrastructure
- **ChromaDB server**: `http://192.168.0.145:8000` (on NAS)
- **Embedding API**: `http://192.168.0.75:11434/v1` (Ollama on MacBook M1)
- **Embedding model**: `nomic-embed-text` (~270MB, fast on M1)
### Prerequisite
Ollama must be running on the MacBook with LAN access enabled:
```bash
OLLAMA_HOST=0.0.0.0 ollama serve
ollama pull nomic-embed-text
```
---
## Docker Services
| Service | Profile | Purpose |
|---|---|---|
| app | (default) | Admin dashboard on port 3001 |
| scraper | tools | Generic scraper (on-demand) |
| scraper-english | scraper-english | English language scraper |
| scraper-french | scraper-french | French language scraper |
| scraper-german | scraper-german | German language scraper |
| scraper-italian | scraper-italian | Italian language scraper |
| scraper-spanish | scraper-spanish | Spanish language scraper |
| scraper-generic | scraper-generic | Generic fallback scraper |
| scheduler | (default) | Background job scheduler |
| freesearch-enrichment | (default) | FreeSearch enrichment daemon |
---
## Environment Variables
```env
DATABASE_URL="postgresql://postgres:postgres@192.168.0.145:5434/nearestmass"
ADMIN_API_KEY=your-secret-key
CHROMADB_URL=http://192.168.0.145:8000
EMBEDDING_API_URL=http://192.168.0.75:11434/v1
EMBEDDING_MODEL=nomic-embed-text
GOOGLE_PLACES_API_KEY=your-google-key
FREESEARCH_URL=http://192.168.0.145:3111
```
---
## NAS Deployment
ScraperControl is deployed on the Synology NAS at `/volume1/docker/scraper-control/`.
### Container Layout
| Container | Purpose | Port |
|-----------|---------|------|
| scraper-control-app-1 | Admin dashboard | 3001 |
| scraper-control-scheduler-1 | Job scheduler | - |
| scraper-control-freesearch-enrichment-1 | FreeSearch daemon | - |
The `db` container (`nearestmass-db-1`) is managed by BethelGuide's compose file at `/volume1/docker/nearestmass/`. ScraperControl joins the same `nearestmass_default` external Docker network — no `depends_on` allowed since `db` is in a different compose file.
### Deploying Updates
```bash
# From local machine:
bash scripts/deploy-to-nas.sh
# Or manually:
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
/Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scraper && /usr/local/bin/docker compose up -d scheduler freesearch-enrichment'
```
### Rebuilding Admin Dashboard
```bash
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build app && /usr/local/bin/docker compose up -d app'
```
### Important Notes
- **DO NOT** add `depends_on: db` to any service — `db` is in BethelGuide's compose file
- The `.env` on NAS uses host IP (`192.168.0.145:5434`) for scripts run outside Docker
- The `docker-compose.yml` environment overrides use `db:5432` (Docker DNS via shared network)
- Docker binary on NAS is at `/usr/local/bin/docker`
### NAS Docker Health
The Synology NAS (4 CPU, 17GB RAM) runs 23 containers across 7 projects. Church project containers (5) all have memory limits and log rotation. See `memory/nas-docker-health.md` for full inventory.
**Scheduler hardening**: Uses `detached: true` + process group kill to prevent orphaned Chromium processes, `init: true` for zombie reaping, 24h job timeout, 8GB memory limit.
**Maintenance**: Docker is on /volume1 (15TB free). Run `docker builder prune -f` occasionally to keep build cache tidy.