# Design: BuscarMisas Network Importer **Date:** 2026-03-12 **Script:** `scripts/import-buscarmisas-network.ts` ## Overview A single importer script covering a network of 7 Catholic mass-time directory sites that share identical WordPress/Yoast structure. All sites use the URL pattern `/{region}/{city}/{slug}/` and the same HTML layout with a mass schedule table. One script with a `--site ` flag handles all countries. ## Sites | Code | Domain | Country | Language | Est. Churches | |------|--------|---------|----------|---------------| | `br` | horariosmissa.com.br | Brazil | Portuguese | ~2,000 | | `mx` | buscarmisas.com.mx | Mexico | Spanish | ~2,000 | | `ar` | horariosmisa.com.ar | Argentina | Spanish | ~2,000 | | `co` | buscarmisas.co | Colombia | Spanish | ~1,000 | | `cl` | horariomisa.cl | Chile | Spanish | ~1,000 | | `gb` | masstime.co.uk | United Kingdom | English | ~1,000 | | `ch` | horairemesses.ch | Switzerland | French | ~500 | **Total: ~10,500 new churches across 7 countries** ## Architecture ### Site Registry ```typescript const SITES: Record = { br: { baseUrl: 'https://horariosmissa.com.br', country: 'BR', language: 'pt' }, mx: { baseUrl: 'https://buscarmisas.com.mx', country: 'MX', language: 'es' }, ar: { baseUrl: 'https://horariosmisa.com.ar', country: 'AR', language: 'es' }, co: { baseUrl: 'https://buscarmisas.co', country: 'CO', language: 'es' }, cl: { baseUrl: 'https://horariomisa.cl', country: 'CL', language: 'es' }, gb: { baseUrl: 'https://masstime.co.uk', country: 'GB', language: 'en' }, ch: { baseUrl: 'https://horairemesses.ch', country: 'CH', language: 'fr' }, }; ``` Each `SiteConfig` includes a `dayMap: Record` mapping localized day names to 0–6 (Sun–Sat). ### Processing Flow 1. **Sitemap discovery** — fetch `{baseUrl}/sitemap_index.xml` → extract `page-sitemap*.xml` URLs 2. **URL collection** — fetch each page-sitemap → extract church URLs filtering to exactly 3 path segments (`/{region}/{city}/{slug}/`) 3. **Dedup skip** — load existing DB churches for that country; skip URLs whose slug is already stored as `source`+`sourceId` 4. **Per-church fetch** — GET church page, parse HTML: - **Name**: H1 heading - **Address**: contact info table (`Endereço`/`Dirección`/`Address`/`Adresse`) - **City/Region**: from URL path segments or table - **Phone**: table row - **Website**: table row - **Mass schedule**: `` with day column and time column 5. **Geocoding** — if no coordinates: Nominatim lookup on `{name}, {city}, {country}` 6. **Upsert** — `findDuplicateChurch()` match → create or update church + mass schedules 7. **Rate limiting** — 3s between HTTP requests; 1.1s between Nominatim requests ### Data Extraction Church page HTML structure (identical across all sites): ```html

Church Name

EndereçoRua X, 123, City
Telefone+55 11 1234-5678
Sitehttps://...
...
DiaHorário da Missa
Segunda-feira08:00
``` Multiple times per day are comma- or newline-separated within the time cell. ### Day Maps - **Portuguese (br)**: Segunda-feira=1, Terça-feira=2, Quarta-feira=3, Quinta-feira=4, Sexta-feira=5, Sábado=6, Domingo=0 - **Spanish (mx/ar/co/cl)**: Lunes=1, Martes=2, Miércoles=3, Jueves=4, Viernes=5, Sábado=6, Domingo=0 - **English (gb)**: Monday=1, Tuesday=2, Wednesday=3, Thursday=4, Friday=5, Saturday=6, Sunday=0 - **French (ch)**: Lundi=1, Mardi=2, Mercredi=3, Jeudi=4, Vendredi=5, Samedi=6, Dimanche=0 ### URL Exclusion Patterns Skip non-church pages matching: - `/about*/`, `/contact*/`, `/privacy*/`, `/cookie*/`, `/terms*/` - `/blog/`, `/news*/`, `/noticias/`, `/actualidad/` - Any page with < 3 path segments (state/city index pages) ## CLI Interface ``` npx tsx scripts/import-buscarmisas-network.ts --all --site br npx tsx scripts/import-buscarmisas-network.ts --all --site mx --dry-run npx tsx scripts/import-buscarmisas-network.ts --all --site gb --limit 300 npx tsx scripts/import-buscarmisas-network.ts --all --site ar --resume-from 500 npx tsx scripts/import-buscarmisas-network.ts --all --site co --job-id {uuid} ``` | Flag | Description | |------|-------------| | `--site ` | Required. Which site to import (br/mx/ar/co/cl/gb/ch) | | `--all` | Run full import | | `--dry-run` | Parse only, no DB writes | | `--limit N` | Cap churches processed per run | | `--resume-from N` | Skip first N church URLs | | `--job-id ` | Bind to background job record | ## Scheduler Integration 7 new sequential phases appended to the `imports` pipeline group in `scripts/scheduler.ts`: ```typescript { name: 'buscarmisas-br', type: 'buscarmisas-network-import', config: { site: 'br', limit: 300 } }, { name: 'buscarmisas-mx', type: 'buscarmisas-network-import', config: { site: 'mx', limit: 300 } }, { name: 'buscarmisas-ar', type: 'buscarmisas-network-import', config: { site: 'ar', limit: 300 } }, { name: 'buscarmisas-co', type: 'buscarmisas-network-import', config: { site: 'co', limit: 300 } }, { name: 'buscarmisas-cl', type: 'buscarmisas-network-import', config: { site: 'cl', limit: 300 } }, { name: 'buscarmisas-gb', type: 'buscarmisas-network-import', config: { site: 'gb', limit: 300 } }, { name: 'buscarmisas-ch', type: 'buscarmisas-network-import', config: { site: 'ch', limit: 300 } }, ``` New `getJobCommand` case: ```typescript case 'buscarmisas-network-import': { const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--all']; if (config?.site) args.push('--site', String(config.site)); if (config?.limit) args.push('--limit', String(config.limit)); if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom)); return { command: 'npx', args }; } ``` ## Error Handling - Failed fetches: log and skip, continue with next church - Parse failures: log and skip (don't crash) - DB errors during upsert: log and skip - Unhandled exception in `main()`: catch → update job to `failed` → rethrow (same pattern as other importers) ## Files Modified - **New**: `scripts/import-buscarmisas-network.ts` - **Modified**: `scripts/scheduler.ts` — add 7 pipeline phases + `getJobCommand` case