Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
44 lines
1.8 KiB
Markdown
44 lines
1.8 KiB
Markdown
# Parallel Scrapers with Country Mapping Fix
|
|
|
|
## Problem
|
|
|
|
The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.
|
|
|
|
## Changes
|
|
|
|
### 1. Country Mapping Additions (scraper-service.ts)
|
|
|
|
Add to `COUNTRY_SCRAPER_MAP`:
|
|
- **English**: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
|
|
- **French**: BE, LU
|
|
- **German**: CH, SI
|
|
- **Italian**: HR, RO
|
|
|
|
### 2. Parallel Pipeline Groups (scheduler.ts)
|
|
|
|
Replace sequential `PIPELINE_PHASES` array with grouped phases:
|
|
|
|
| Group | Phases | Concurrency |
|
|
|-------|--------|-------------|
|
|
| 1 | osm-import, gcatholic-import | Sequential (shared data) |
|
|
| 2 | english, french, german | Parallel (3) |
|
|
| 3 | polish, spanish, italian | Parallel (3) |
|
|
| 4 | portuguese, czech, dutch | Parallel (3) |
|
|
| 5 | hungarian, generic | Parallel (2) |
|
|
|
|
Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.
|
|
|
|
### 3. Generic Scraper Deprioritized
|
|
|
|
- Moved to last group
|
|
- Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)
|
|
|
|
### 4. Resource Changes
|
|
|
|
- Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
|
|
- No new Docker containers or compose changes needed — existing child process spawning approach is kept
|
|
|
|
## Approach
|
|
|
|
Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns `npx tsx` processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.
|