Files
ScraperControl/docs/superpowers/specs/2026-03-12-buscarmisas-network-importer-design.md

146 lines
6.3 KiB
Markdown
Raw Normal View History

# Design: BuscarMisas Network Importer
**Date:** 2026-03-12
**Script:** `scripts/import-buscarmisas-network.ts`
## Overview
A single importer script covering a network of 7 Catholic mass-time directory sites that share identical WordPress/Yoast structure. All sites use the URL pattern `/{region}/{city}/{slug}/` and the same HTML layout with a mass schedule table. One script with a `--site <code>` flag handles all countries.
## Sites
| Code | Domain | Country | Language | Est. Churches |
|------|--------|---------|----------|---------------|
| `br` | horariosmissa.com.br | Brazil | Portuguese | ~2,000 |
| `mx` | buscarmisas.com.mx | Mexico | Spanish | ~2,000 |
| `ar` | horariosmisa.com.ar | Argentina | Spanish | ~2,000 |
| `co` | buscarmisas.co | Colombia | Spanish | ~1,000 |
| `cl` | horariomisa.cl | Chile | Spanish | ~1,000 |
| `gb` | masstime.co.uk | United Kingdom | English | ~1,000 |
| `ch` | horairemesses.ch | Switzerland | French | ~500 |
**Total: ~10,500 new churches across 7 countries**
## Architecture
### Site Registry
```typescript
const SITES: Record<string, SiteConfig> = {
br: { baseUrl: 'https://horariosmissa.com.br', country: 'BR', language: 'pt' },
mx: { baseUrl: 'https://buscarmisas.com.mx', country: 'MX', language: 'es' },
ar: { baseUrl: 'https://horariosmisa.com.ar', country: 'AR', language: 'es' },
co: { baseUrl: 'https://buscarmisas.co', country: 'CO', language: 'es' },
cl: { baseUrl: 'https://horariomisa.cl', country: 'CL', language: 'es' },
gb: { baseUrl: 'https://masstime.co.uk', country: 'GB', language: 'en' },
ch: { baseUrl: 'https://horairemesses.ch', country: 'CH', language: 'fr' },
};
```
Each `SiteConfig` includes a `dayMap: Record<string, number>` mapping localized day names to 06 (SunSat).
### Processing Flow
1. **Sitemap discovery** — fetch `{baseUrl}/sitemap_index.xml` → extract `page-sitemap*.xml` URLs
2. **URL collection** — fetch each page-sitemap → extract church URLs filtering to exactly 3 path segments (`/{region}/{city}/{slug}/`)
3. **Dedup skip** — load existing DB churches for that country; skip URLs whose slug is already stored as `source`+`sourceId`
4. **Per-church fetch** — GET church page, parse HTML:
- **Name**: H1 heading
- **Address**: contact info table (`Endereço`/`Dirección`/`Address`/`Adresse`)
- **City/Region**: from URL path segments or table
- **Phone**: table row
- **Website**: table row
- **Mass schedule**: `<table>` with day column and time column
5. **Geocoding** — if no coordinates: Nominatim lookup on `{name}, {city}, {country}`
6. **Upsert**`findDuplicateChurch()` match → create or update church + mass schedules
7. **Rate limiting** — 3s between HTTP requests; 1.1s between Nominatim requests
### Data Extraction
Church page HTML structure (identical across all sites):
```html
<h1>Church Name</h1>
<table>
<tr><td>Endereço</td><td>Rua X, 123, City</td></tr>
<tr><td>Telefone</td><td>+55 11 1234-5678</td></tr>
<tr><td>Site</td><td>https://...</td></tr>
</table>
<table> <!-- mass schedule -->
<tr><th>Dia</th><th>Horário da Missa</th></tr>
<tr><td>Segunda-feira</td><td>08:00</td></tr>
...
</table>
```
Multiple times per day are comma- or newline-separated within the time cell.
### Day Maps
- **Portuguese (br)**: Segunda-feira=1, Terça-feira=2, Quarta-feira=3, Quinta-feira=4, Sexta-feira=5, Sábado=6, Domingo=0
- **Spanish (mx/ar/co/cl)**: Lunes=1, Martes=2, Miércoles=3, Jueves=4, Viernes=5, Sábado=6, Domingo=0
- **English (gb)**: Monday=1, Tuesday=2, Wednesday=3, Thursday=4, Friday=5, Saturday=6, Sunday=0
- **French (ch)**: Lundi=1, Mardi=2, Mercredi=3, Jeudi=4, Vendredi=5, Samedi=6, Dimanche=0
### URL Exclusion Patterns
Skip non-church pages matching:
- `/about*/`, `/contact*/`, `/privacy*/`, `/cookie*/`, `/terms*/`
- `/blog/`, `/news*/`, `/noticias/`, `/actualidad/`
- Any page with < 3 path segments (state/city index pages)
## CLI Interface
```
npx tsx scripts/import-buscarmisas-network.ts --all --site br
npx tsx scripts/import-buscarmisas-network.ts --all --site mx --dry-run
npx tsx scripts/import-buscarmisas-network.ts --all --site gb --limit 300
npx tsx scripts/import-buscarmisas-network.ts --all --site ar --resume-from 500
npx tsx scripts/import-buscarmisas-network.ts --all --site co --job-id {uuid}
```
| Flag | Description |
|------|-------------|
| `--site <code>` | Required. Which site to import (br/mx/ar/co/cl/gb/ch) |
| `--all` | Run full import |
| `--dry-run` | Parse only, no DB writes |
| `--limit N` | Cap churches processed per run |
| `--resume-from N` | Skip first N church URLs |
| `--job-id <uuid>` | Bind to background job record |
## Scheduler Integration
7 new sequential phases appended to the `imports` pipeline group in `scripts/scheduler.ts`:
```typescript
{ name: 'buscarmisas-br', type: 'buscarmisas-network-import', config: { site: 'br', limit: 300 } },
{ name: 'buscarmisas-mx', type: 'buscarmisas-network-import', config: { site: 'mx', limit: 300 } },
{ name: 'buscarmisas-ar', type: 'buscarmisas-network-import', config: { site: 'ar', limit: 300 } },
{ name: 'buscarmisas-co', type: 'buscarmisas-network-import', config: { site: 'co', limit: 300 } },
{ name: 'buscarmisas-cl', type: 'buscarmisas-network-import', config: { site: 'cl', limit: 300 } },
{ name: 'buscarmisas-gb', type: 'buscarmisas-network-import', config: { site: 'gb', limit: 300 } },
{ name: 'buscarmisas-ch', type: 'buscarmisas-network-import', config: { site: 'ch', limit: 300 } },
```
New `getJobCommand` case:
```typescript
case 'buscarmisas-network-import': {
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--all'];
if (config?.site) args.push('--site', String(config.site));
if (config?.limit) args.push('--limit', String(config.limit));
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
```
## Error Handling
- Failed fetches: log and skip, continue with next church
- Parse failures: log and skip (don't crash)
- DB errors during upsert: log and skip
- Unhandled exception in `main()`: catch → update job to `failed` → rethrow (same pattern as other importers)
## Files Modified
- **New**: `scripts/import-buscarmisas-network.ts`
- **Modified**: `scripts/scheduler.ts` — add 7 pipeline phases + `getJobCommand` case