chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
43
docs/plans/2026-02-25-parallel-scrapers-design.md
Normal file
43
docs/plans/2026-02-25-parallel-scrapers-design.md
Normal file
@@ -0,0 +1,43 @@
|
||||
# Parallel Scrapers with Country Mapping Fix
|
||||
|
||||
## Problem
|
||||
|
||||
The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.
|
||||
|
||||
## Changes
|
||||
|
||||
### 1. Country Mapping Additions (scraper-service.ts)
|
||||
|
||||
Add to `COUNTRY_SCRAPER_MAP`:
|
||||
- **English**: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
|
||||
- **French**: BE, LU
|
||||
- **German**: CH, SI
|
||||
- **Italian**: HR, RO
|
||||
|
||||
### 2. Parallel Pipeline Groups (scheduler.ts)
|
||||
|
||||
Replace sequential `PIPELINE_PHASES` array with grouped phases:
|
||||
|
||||
| Group | Phases | Concurrency |
|
||||
|-------|--------|-------------|
|
||||
| 1 | osm-import, gcatholic-import | Sequential (shared data) |
|
||||
| 2 | english, french, german | Parallel (3) |
|
||||
| 3 | polish, spanish, italian | Parallel (3) |
|
||||
| 4 | portuguese, czech, dutch | Parallel (3) |
|
||||
| 5 | hungarian, generic | Parallel (2) |
|
||||
|
||||
Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.
|
||||
|
||||
### 3. Generic Scraper Deprioritized
|
||||
|
||||
- Moved to last group
|
||||
- Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)
|
||||
|
||||
### 4. Resource Changes
|
||||
|
||||
- Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
|
||||
- No new Docker containers or compose changes needed — existing child process spawning approach is kept
|
||||
|
||||
## Approach
|
||||
|
||||
Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns `npx tsx` processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.
|
||||
Reference in New Issue
Block a user