Files
ScraperControl/docs/plans/2026-02-25-parallel-scrapers-design.md
Albert 2c51513851 chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-12 19:11:22 -04:00

1.8 KiB

Parallel Scrapers with Country Mapping Fix

Problem

The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.

Changes

1. Country Mapping Additions (scraper-service.ts)

Add to COUNTRY_SCRAPER_MAP:

  • English: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
  • French: BE, LU
  • German: CH, SI
  • Italian: HR, RO

2. Parallel Pipeline Groups (scheduler.ts)

Replace sequential PIPELINE_PHASES array with grouped phases:

Group Phases Concurrency
1 osm-import, gcatholic-import Sequential (shared data)
2 english, french, german Parallel (3)
3 polish, spanish, italian Parallel (3)
4 portuguese, czech, dutch Parallel (3)
5 hungarian, generic Parallel (2)

Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.

3. Generic Scraper Deprioritized

  • Moved to last group
  • Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)

4. Resource Changes

  • Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
  • No new Docker containers or compose changes needed — existing child process spawning approach is kept

Approach

Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns npx tsx processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.