diff --git a/docs/superpowers/specs/2026-03-12-buscarmisas-network-importer-design.md b/docs/superpowers/specs/2026-03-12-buscarmisas-network-importer-design.md new file mode 100644 index 0000000..ce878f9 --- /dev/null +++ b/docs/superpowers/specs/2026-03-12-buscarmisas-network-importer-design.md @@ -0,0 +1,145 @@ +# Design: BuscarMisas Network Importer + +**Date:** 2026-03-12 +**Script:** `scripts/import-buscarmisas-network.ts` + +## Overview + +A single importer script covering a network of 7 Catholic mass-time directory sites that share identical WordPress/Yoast structure. All sites use the URL pattern `/{region}/{city}/{slug}/` and the same HTML layout with a mass schedule table. One script with a `--site ` flag handles all countries. + +## Sites + +| Code | Domain | Country | Language | Est. Churches | +|------|--------|---------|----------|---------------| +| `br` | horariosmissa.com.br | Brazil | Portuguese | ~2,000 | +| `mx` | buscarmisas.com.mx | Mexico | Spanish | ~2,000 | +| `ar` | horariosmisa.com.ar | Argentina | Spanish | ~2,000 | +| `co` | buscarmisas.co | Colombia | Spanish | ~1,000 | +| `cl` | horariomisa.cl | Chile | Spanish | ~1,000 | +| `gb` | masstime.co.uk | United Kingdom | English | ~1,000 | +| `ch` | horairemesses.ch | Switzerland | French | ~500 | + +**Total: ~10,500 new churches across 7 countries** + +## Architecture + +### Site Registry + +```typescript +const SITES: Record = { + br: { baseUrl: 'https://horariosmissa.com.br', country: 'BR', language: 'pt' }, + mx: { baseUrl: 'https://buscarmisas.com.mx', country: 'MX', language: 'es' }, + ar: { baseUrl: 'https://horariosmisa.com.ar', country: 'AR', language: 'es' }, + co: { baseUrl: 'https://buscarmisas.co', country: 'CO', language: 'es' }, + cl: { baseUrl: 'https://horariomisa.cl', country: 'CL', language: 'es' }, + gb: { baseUrl: 'https://masstime.co.uk', country: 'GB', language: 'en' }, + ch: { baseUrl: 'https://horairemesses.ch', country: 'CH', language: 'fr' }, +}; +``` + +Each `SiteConfig` includes a `dayMap: Record` mapping localized day names to 0–6 (Sun–Sat). + +### Processing Flow + +1. **Sitemap discovery** — fetch `{baseUrl}/sitemap_index.xml` → extract `page-sitemap*.xml` URLs +2. **URL collection** — fetch each page-sitemap → extract church URLs filtering to exactly 3 path segments (`/{region}/{city}/{slug}/`) +3. **Dedup skip** — load existing DB churches for that country; skip URLs whose slug is already stored as `source`+`sourceId` +4. **Per-church fetch** — GET church page, parse HTML: + - **Name**: H1 heading + - **Address**: contact info table (`Endereço`/`Dirección`/`Address`/`Adresse`) + - **City/Region**: from URL path segments or table + - **Phone**: table row + - **Website**: table row + - **Mass schedule**: `` with day column and time column +5. **Geocoding** — if no coordinates: Nominatim lookup on `{name}, {city}, {country}` +6. **Upsert** — `findDuplicateChurch()` match → create or update church + mass schedules +7. **Rate limiting** — 3s between HTTP requests; 1.1s between Nominatim requests + +### Data Extraction + +Church page HTML structure (identical across all sites): +```html +

Church Name

+
+ + + +
EndereçoRua X, 123, City
Telefone+55 11 1234-5678
Sitehttps://...
+ + + + ... +
DiaHorário da Missa
Segunda-feira08:00
+``` + +Multiple times per day are comma- or newline-separated within the time cell. + +### Day Maps + +- **Portuguese (br)**: Segunda-feira=1, Terça-feira=2, Quarta-feira=3, Quinta-feira=4, Sexta-feira=5, Sábado=6, Domingo=0 +- **Spanish (mx/ar/co/cl)**: Lunes=1, Martes=2, Miércoles=3, Jueves=4, Viernes=5, Sábado=6, Domingo=0 +- **English (gb)**: Monday=1, Tuesday=2, Wednesday=3, Thursday=4, Friday=5, Saturday=6, Sunday=0 +- **French (ch)**: Lundi=1, Mardi=2, Mercredi=3, Jeudi=4, Vendredi=5, Samedi=6, Dimanche=0 + +### URL Exclusion Patterns + +Skip non-church pages matching: +- `/about*/`, `/contact*/`, `/privacy*/`, `/cookie*/`, `/terms*/` +- `/blog/`, `/news*/`, `/noticias/`, `/actualidad/` +- Any page with < 3 path segments (state/city index pages) + +## CLI Interface + +``` +npx tsx scripts/import-buscarmisas-network.ts --all --site br +npx tsx scripts/import-buscarmisas-network.ts --all --site mx --dry-run +npx tsx scripts/import-buscarmisas-network.ts --all --site gb --limit 300 +npx tsx scripts/import-buscarmisas-network.ts --all --site ar --resume-from 500 +npx tsx scripts/import-buscarmisas-network.ts --all --site co --job-id {uuid} +``` + +| Flag | Description | +|------|-------------| +| `--site ` | Required. Which site to import (br/mx/ar/co/cl/gb/ch) | +| `--all` | Run full import | +| `--dry-run` | Parse only, no DB writes | +| `--limit N` | Cap churches processed per run | +| `--resume-from N` | Skip first N church URLs | +| `--job-id ` | Bind to background job record | + +## Scheduler Integration + +7 new sequential phases appended to the `imports` pipeline group in `scripts/scheduler.ts`: + +```typescript +{ name: 'buscarmisas-br', type: 'buscarmisas-network-import', config: { site: 'br', limit: 300 } }, +{ name: 'buscarmisas-mx', type: 'buscarmisas-network-import', config: { site: 'mx', limit: 300 } }, +{ name: 'buscarmisas-ar', type: 'buscarmisas-network-import', config: { site: 'ar', limit: 300 } }, +{ name: 'buscarmisas-co', type: 'buscarmisas-network-import', config: { site: 'co', limit: 300 } }, +{ name: 'buscarmisas-cl', type: 'buscarmisas-network-import', config: { site: 'cl', limit: 300 } }, +{ name: 'buscarmisas-gb', type: 'buscarmisas-network-import', config: { site: 'gb', limit: 300 } }, +{ name: 'buscarmisas-ch', type: 'buscarmisas-network-import', config: { site: 'ch', limit: 300 } }, +``` + +New `getJobCommand` case: +```typescript +case 'buscarmisas-network-import': { + const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--all']; + if (config?.site) args.push('--site', String(config.site)); + if (config?.limit) args.push('--limit', String(config.limit)); + if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom)); + return { command: 'npx', args }; +} +``` + +## Error Handling + +- Failed fetches: log and skip, continue with next church +- Parse failures: log and skip (don't crash) +- DB errors during upsert: log and skip +- Unhandled exception in `main()`: catch → update job to `failed` → rethrow (same pattern as other importers) + +## Files Modified + +- **New**: `scripts/import-buscarmisas-network.ts` +- **Modified**: `scripts/scheduler.ts` — add 7 pipeline phases + `getJobCommand` case