255 lines
11 KiB
Markdown
255 lines
11 KiB
Markdown
|
|
# Role in Ecosystem
|
||
|
|
|
||
|
|
**ScraperControl** is the data pipeline for the Church project — handling scraping, enrichment, ChromaDB semantic search, and data transfer to Neon production. It runs on the Synology NAS (Docker), not Vercel.
|
||
|
|
|
||
|
|
- **Schema sync**: Handled by `npm run sync` from the `Church/` root directory. No need to manually copy schema files.
|
||
|
|
- **Coordinated deployment**: Use `npm run deploy` from `Church/` root for full pipeline deployment.
|
||
|
|
- **Schema source of truth**: BethelGuide — never run `prisma migrate` in ScraperControl.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
# Claude Instructions for ScraperControl
|
||
|
|
|
||
|
|
## Project Overview
|
||
|
|
|
||
|
|
**ScraperControl** is the scraping, enrichment, and data management backend for the NearestMass church finder. It provides:
|
||
|
|
|
||
|
|
1. **Admin Dashboard** (Next.js): Job management UI at port 3001
|
||
|
|
2. **Web Scrapers**: Playwright-based scrapers for extracting mass schedules from church websites
|
||
|
|
3. **Enrichment Pipelines**: Google Places, FreeSearch, reverse geocode enrichment
|
||
|
|
4. **ChromaDB Integration**: Semantic search for deduplication, content classification, and change detection
|
||
|
|
5. **Scheduler**: Database-driven job queue for automated scraping
|
||
|
|
|
||
|
|
### Shared Database Architecture
|
||
|
|
|
||
|
|
ScraperControl and BethelGuide share the **same NAS PostgreSQL database** (192.168.0.145:5434). BethelGuide is the **schema source of truth**. After any schema change in BethelGuide:
|
||
|
|
|
||
|
|
1. Copy `BethelGuide/prisma/schema.prisma` → `ScraperControl/prisma/schema.prisma`
|
||
|
|
2. Run `npx prisma generate` in ScraperControl (NOT `migrate`)
|
||
|
|
3. Rebuild Docker containers if needed
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Tech Stack
|
||
|
|
|
||
|
|
| Layer | Technology |
|
||
|
|
|-------|------------|
|
||
|
|
| Admin UI | Next.js 16, React 19, Tailwind CSS v4 |
|
||
|
|
| Database | Shared NAS PostgreSQL (192.168.0.145:5434) |
|
||
|
|
| ORM | Prisma 7 (`@prisma/adapter-pg` + `pg` Pool) |
|
||
|
|
| Web Scraping | Playwright (headless Chromium) |
|
||
|
|
| Vector DB | ChromaDB (192.168.0.145:8000) |
|
||
|
|
| Embeddings | Ollama on MacBook (192.168.0.75:11434) with nomic-embed-text |
|
||
|
|
| Scheduling | node-cron + database-driven job queue |
|
||
|
|
| Containerization | Docker, Docker Compose |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Project Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
src/
|
||
|
|
├── app/ # Next.js Admin Dashboard (port 3001)
|
||
|
|
│ ├── page.tsx # Main dashboard (Jobs, Scrapes, Search tabs)
|
||
|
|
│ └── api/admin/ # Admin API routes
|
||
|
|
│ ├── jobs/ # Job management (GET/POST/PATCH)
|
||
|
|
│ ├── scrape-log/ # Recently scraped churches log
|
||
|
|
│ └── freesearch-log/ # FreeSearch results log
|
||
|
|
│
|
||
|
|
├── chromadb/ # ChromaDB integration
|
||
|
|
│ ├── client.ts # ChromaDB client singleton
|
||
|
|
│ ├── embeddings.ts # OpenAI-compatible embedding helper (Ollama)
|
||
|
|
│ ├── collections.ts # Collection definitions (5 collections)
|
||
|
|
│ └── queries.ts # Query helpers per use case
|
||
|
|
│
|
||
|
|
├── lib/ # Core business logic
|
||
|
|
│ ├── db.ts # Prisma client singleton
|
||
|
|
│ ├── admin-auth.ts # Timing-safe API key auth
|
||
|
|
│ ├── geo.ts # Haversine distance (minimal)
|
||
|
|
│ ├── scraper-service.ts # Scraper orchestration
|
||
|
|
│ ├── overpass-client.ts # OpenStreetMap Overpass API
|
||
|
|
│ ├── church-matcher.ts # Church matching/dedup
|
||
|
|
│ └── masstimes-scraper.ts # MassTimes.org integration
|
||
|
|
│
|
||
|
|
└── scrapers/ # Web scraping system
|
||
|
|
├── base-scraper.ts # Base class
|
||
|
|
├── index.ts # Exports
|
||
|
|
├── registry.ts # Strategy registry
|
||
|
|
├── url-discovery.ts # Mass schedule URL finder
|
||
|
|
├── strategies/ # Language-specific scrapers
|
||
|
|
│ ├── generic.ts # Fallback (10+ languages)
|
||
|
|
│ ├── english.ts
|
||
|
|
│ ├── french.ts
|
||
|
|
│ ├── german.ts
|
||
|
|
│ ├── italian.ts
|
||
|
|
│ └── spanish.ts
|
||
|
|
└── i18n/ # Internationalization
|
||
|
|
├── day-names.ts # Day name patterns per language
|
||
|
|
└── day-ranges.ts # Day range parsing ("Monday-Friday")
|
||
|
|
|
||
|
|
scripts/ # CLI scripts
|
||
|
|
├── scrape-churches.ts # Scrape churches by language
|
||
|
|
├── scrape-masstimes.ts # Scrape from MassTimes.org
|
||
|
|
├── import-osm-churches.ts # Import from OpenStreetMap
|
||
|
|
├── import-osm-region.ts # Import specific OSM region
|
||
|
|
├── enrich-with-google-places.ts # Google Places enrichment
|
||
|
|
├── enrich-with-freesearch.ts # FreeSearch website enrichment
|
||
|
|
├── enrich-with-reverse-geocode.ts # Reverse geocode enrichment
|
||
|
|
├── scheduler.ts # Background job scheduler
|
||
|
|
├── dedup-mass-schedules.ts # Mass schedule deduplication
|
||
|
|
├── dedup-churches.ts # Church dedup via ChromaDB
|
||
|
|
├── transfer-enriched-to-neon.ts # NAS → Neon production sync
|
||
|
|
├── populate-chromadb.ts # Bulk-populate ChromaDB collections
|
||
|
|
├── populate-city-normalized.ts
|
||
|
|
├── save-schedules-to-db.ts
|
||
|
|
├── test-scraper.ts # Test scraper on a URL
|
||
|
|
├── test-url-discovery.ts # Test URL discovery
|
||
|
|
├── test-edge-cases.ts # International edge case tests
|
||
|
|
└── debug/ # Debug/investigation scripts (~44 files)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Common Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# === DEVELOPMENT ===
|
||
|
|
npm run dev # Start admin dashboard (localhost:3001)
|
||
|
|
npm run build # Build Next.js app
|
||
|
|
|
||
|
|
# === SCRAPING ===
|
||
|
|
npm run scrape:churches # Scrape churches (pass --language, --all flags)
|
||
|
|
npm run scrape:masstimes # Scrape from MassTimes.org
|
||
|
|
npm run test:scraper # Test scraper on a URL
|
||
|
|
npm run test:discover # Test URL discovery
|
||
|
|
|
||
|
|
# === ENRICHMENT ===
|
||
|
|
npm run enrich:places # Google Places enrichment
|
||
|
|
npm run enrich:freesearch # FreeSearch website enrichment
|
||
|
|
|
||
|
|
# === DATA MANAGEMENT ===
|
||
|
|
npm run dedup:masses # Deduplicate mass schedules
|
||
|
|
npm run import:osm # Import churches from OpenStreetMap
|
||
|
|
npm run transfer:neon # Transfer enriched data to Neon production
|
||
|
|
npm run scheduler # Start background job scheduler
|
||
|
|
|
||
|
|
# === CHROMADB ===
|
||
|
|
npx tsx scripts/populate-chromadb.ts --all # Populate all collections
|
||
|
|
npx tsx scripts/populate-chromadb.ts --collection church_identity # Single collection
|
||
|
|
npx tsx scripts/dedup-churches.ts --threshold 0.15 # Find duplicate churches
|
||
|
|
|
||
|
|
# === DOCKER (on NAS) ===
|
||
|
|
docker compose build scraper # Build scraper image
|
||
|
|
docker compose --profile tools run --rm scraper <command> # Run one-off scraper
|
||
|
|
docker compose up -d scheduler freesearch-enrichment # Start background services
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ChromaDB Integration
|
||
|
|
|
||
|
|
### Collections
|
||
|
|
|
||
|
|
| Collection | Purpose | Documents |
|
||
|
|
|---|---|---|
|
||
|
|
| `church_identity` | Deduplication | `{name} {address} {city} {country}` |
|
||
|
|
| `search_results` | FreeSearch matching | `{title} {snippet} {url}` |
|
||
|
|
| `page_classification` | Content classification | Page text (first 2000 chars) |
|
||
|
|
| `schedule_sections` | Schedule detection | Text blocks with mass times |
|
||
|
|
| `page_snapshots` | Change detection | Full page text |
|
||
|
|
|
||
|
|
### Infrastructure
|
||
|
|
|
||
|
|
- **ChromaDB server**: `http://192.168.0.145:8000` (on NAS)
|
||
|
|
- **Embedding API**: `http://192.168.0.75:11434/v1` (Ollama on MacBook M1)
|
||
|
|
- **Embedding model**: `nomic-embed-text` (~270MB, fast on M1)
|
||
|
|
|
||
|
|
### Prerequisite
|
||
|
|
|
||
|
|
Ollama must be running on the MacBook with LAN access enabled:
|
||
|
|
```bash
|
||
|
|
OLLAMA_HOST=0.0.0.0 ollama serve
|
||
|
|
ollama pull nomic-embed-text
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Docker Services
|
||
|
|
|
||
|
|
| Service | Profile | Purpose |
|
||
|
|
|---|---|---|
|
||
|
|
| app | (default) | Admin dashboard on port 3001 |
|
||
|
|
| scraper | tools | Generic scraper (on-demand) |
|
||
|
|
| scraper-english | scraper-english | English language scraper |
|
||
|
|
| scraper-french | scraper-french | French language scraper |
|
||
|
|
| scraper-german | scraper-german | German language scraper |
|
||
|
|
| scraper-italian | scraper-italian | Italian language scraper |
|
||
|
|
| scraper-spanish | scraper-spanish | Spanish language scraper |
|
||
|
|
| scraper-generic | scraper-generic | Generic fallback scraper |
|
||
|
|
| scheduler | (default) | Background job scheduler |
|
||
|
|
| freesearch-enrichment | (default) | FreeSearch enrichment daemon |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Environment Variables
|
||
|
|
|
||
|
|
```env
|
||
|
|
DATABASE_URL="postgresql://postgres:postgres@192.168.0.145:5434/nearestmass"
|
||
|
|
ADMIN_API_KEY=your-secret-key
|
||
|
|
CHROMADB_URL=http://192.168.0.145:8000
|
||
|
|
EMBEDDING_API_URL=http://192.168.0.75:11434/v1
|
||
|
|
EMBEDDING_MODEL=nomic-embed-text
|
||
|
|
GOOGLE_PLACES_API_KEY=your-google-key
|
||
|
|
FREESEARCH_URL=http://192.168.0.145:3111
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## NAS Deployment
|
||
|
|
|
||
|
|
ScraperControl is deployed on the Synology NAS at `/volume1/docker/scraper-control/`.
|
||
|
|
|
||
|
|
### Container Layout
|
||
|
|
|
||
|
|
| Container | Purpose | Port |
|
||
|
|
|-----------|---------|------|
|
||
|
|
| scraper-control-app-1 | Admin dashboard | 3001 |
|
||
|
|
| scraper-control-scheduler-1 | Job scheduler | - |
|
||
|
|
| scraper-control-freesearch-enrichment-1 | FreeSearch daemon | - |
|
||
|
|
|
||
|
|
The `db` container (`nearestmass-db-1`) is managed by BethelGuide's compose file at `/volume1/docker/nearestmass/`. ScraperControl joins the same `nearestmass_default` external Docker network — no `depends_on` allowed since `db` is in a different compose file.
|
||
|
|
|
||
|
|
### Deploying Updates
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# From local machine:
|
||
|
|
bash scripts/deploy-to-nas.sh
|
||
|
|
|
||
|
|
# Or manually:
|
||
|
|
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
|
||
|
|
/Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/
|
||
|
|
|
||
|
|
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scraper && /usr/local/bin/docker compose up -d scheduler freesearch-enrichment'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Rebuilding Admin Dashboard
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build app && /usr/local/bin/docker compose up -d app'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Important Notes
|
||
|
|
|
||
|
|
- **DO NOT** add `depends_on: db` to any service — `db` is in BethelGuide's compose file
|
||
|
|
- The `.env` on NAS uses host IP (`192.168.0.145:5434`) for scripts run outside Docker
|
||
|
|
- The `docker-compose.yml` environment overrides use `db:5432` (Docker DNS via shared network)
|
||
|
|
- Docker binary on NAS is at `/usr/local/bin/docker`
|
||
|
|
|
||
|
|
### NAS Docker Health
|
||
|
|
|
||
|
|
The Synology NAS (4 CPU, 17GB RAM) runs 23 containers across 7 projects. Church project containers (5) all have memory limits and log rotation. See `memory/nas-docker-health.md` for full inventory.
|
||
|
|
|
||
|
|
**Scheduler hardening**: Uses `detached: true` + process group kill to prevent orphaned Chromium processes, `init: true` for zombie reaping, 24h job timeout, 8GB memory limit.
|
||
|
|
|
||
|
|
**Maintenance**: Docker is on /volume1 (15TB free). Run `docker builder prune -f` occasionally to keep build cache tidy.
|