# Parallel Scrapers with Country Mapping Fix ## Problem The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers. ## Changes ### 1. Country Mapping Additions (scraper-service.ts) Add to `COUNTRY_SCRAPER_MAP`: - **English**: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG - **French**: BE, LU - **German**: CH, SI - **Italian**: HR, RO ### 2. Parallel Pipeline Groups (scheduler.ts) Replace sequential `PIPELINE_PHASES` array with grouped phases: | Group | Phases | Concurrency | |-------|--------|-------------| | 1 | osm-import, gcatholic-import | Sequential (shared data) | | 2 | english, french, german | Parallel (3) | | 3 | polish, spanish, italian | Parallel (3) | | 4 | portuguese, czech, dutch | Parallel (3) | | 5 | hungarian, generic | Parallel (2) | Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group. ### 3. Generic Scraper Deprioritized - Moved to last group - Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes) ### 4. Resource Changes - Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes) - No new Docker containers or compose changes needed — existing child process spawning approach is kept ## Approach Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns `npx tsx` processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.