Files
ScraperControl/docs/plans/2026-02-25-parallel-scrapers-design.md

44 lines
1.8 KiB
Markdown
Raw Normal View History

# Parallel Scrapers with Country Mapping Fix
## Problem
The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.
## Changes
### 1. Country Mapping Additions (scraper-service.ts)
Add to `COUNTRY_SCRAPER_MAP`:
- **English**: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
- **French**: BE, LU
- **German**: CH, SI
- **Italian**: HR, RO
### 2. Parallel Pipeline Groups (scheduler.ts)
Replace sequential `PIPELINE_PHASES` array with grouped phases:
| Group | Phases | Concurrency |
|-------|--------|-------------|
| 1 | osm-import, gcatholic-import | Sequential (shared data) |
| 2 | english, french, german | Parallel (3) |
| 3 | polish, spanish, italian | Parallel (3) |
| 4 | portuguese, czech, dutch | Parallel (3) |
| 5 | hungarian, generic | Parallel (2) |
Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.
### 3. Generic Scraper Deprioritized
- Moved to last group
- Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)
### 4. Resource Changes
- Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
- No new Docker containers or compose changes needed — existing child process spawning approach is kept
## Approach
Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns `npx tsx` processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.