Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1.8 KiB
Parallel Scrapers with Country Mapping Fix
Problem
The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.
Changes
1. Country Mapping Additions (scraper-service.ts)
Add to COUNTRY_SCRAPER_MAP:
- English: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
- French: BE, LU
- German: CH, SI
- Italian: HR, RO
2. Parallel Pipeline Groups (scheduler.ts)
Replace sequential PIPELINE_PHASES array with grouped phases:
| Group | Phases | Concurrency |
|---|---|---|
| 1 | osm-import, gcatholic-import | Sequential (shared data) |
| 2 | english, french, german | Parallel (3) |
| 3 | polish, spanish, italian | Parallel (3) |
| 4 | portuguese, czech, dutch | Parallel (3) |
| 5 | hungarian, generic | Parallel (2) |
Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.
3. Generic Scraper Deprioritized
- Moved to last group
- Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)
4. Resource Changes
- Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
- No new Docker containers or compose changes needed — existing child process spawning approach is kept
Approach
Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns npx tsx processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.