Files
ScraperControl/docs/superpowers/specs/2026-03-12-buscarmisas-network-importer-design.md
albertfj114 ef01616ad8 docs: add design spec for buscarmisas network importer
Covers 7-country Latin America + UK + Switzerland mass times
network (horariosmissa.com.br and sister sites), all sharing
identical WordPress structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 00:16:37 -04:00

146 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: BuscarMisas Network Importer
**Date:** 2026-03-12
**Script:** `scripts/import-buscarmisas-network.ts`
## Overview
A single importer script covering a network of 7 Catholic mass-time directory sites that share identical WordPress/Yoast structure. All sites use the URL pattern `/{region}/{city}/{slug}/` and the same HTML layout with a mass schedule table. One script with a `--site <code>` flag handles all countries.
## Sites
| Code | Domain | Country | Language | Est. Churches |
|------|--------|---------|----------|---------------|
| `br` | horariosmissa.com.br | Brazil | Portuguese | ~2,000 |
| `mx` | buscarmisas.com.mx | Mexico | Spanish | ~2,000 |
| `ar` | horariosmisa.com.ar | Argentina | Spanish | ~2,000 |
| `co` | buscarmisas.co | Colombia | Spanish | ~1,000 |
| `cl` | horariomisa.cl | Chile | Spanish | ~1,000 |
| `gb` | masstime.co.uk | United Kingdom | English | ~1,000 |
| `ch` | horairemesses.ch | Switzerland | French | ~500 |
**Total: ~10,500 new churches across 7 countries**
## Architecture
### Site Registry
```typescript
const SITES: Record<string, SiteConfig> = {
br: { baseUrl: 'https://horariosmissa.com.br', country: 'BR', language: 'pt' },
mx: { baseUrl: 'https://buscarmisas.com.mx', country: 'MX', language: 'es' },
ar: { baseUrl: 'https://horariosmisa.com.ar', country: 'AR', language: 'es' },
co: { baseUrl: 'https://buscarmisas.co', country: 'CO', language: 'es' },
cl: { baseUrl: 'https://horariomisa.cl', country: 'CL', language: 'es' },
gb: { baseUrl: 'https://masstime.co.uk', country: 'GB', language: 'en' },
ch: { baseUrl: 'https://horairemesses.ch', country: 'CH', language: 'fr' },
};
```
Each `SiteConfig` includes a `dayMap: Record<string, number>` mapping localized day names to 06 (SunSat).
### Processing Flow
1. **Sitemap discovery** — fetch `{baseUrl}/sitemap_index.xml` → extract `page-sitemap*.xml` URLs
2. **URL collection** — fetch each page-sitemap → extract church URLs filtering to exactly 3 path segments (`/{region}/{city}/{slug}/`)
3. **Dedup skip** — load existing DB churches for that country; skip URLs whose slug is already stored as `source`+`sourceId`
4. **Per-church fetch** — GET church page, parse HTML:
- **Name**: H1 heading
- **Address**: contact info table (`Endereço`/`Dirección`/`Address`/`Adresse`)
- **City/Region**: from URL path segments or table
- **Phone**: table row
- **Website**: table row
- **Mass schedule**: `<table>` with day column and time column
5. **Geocoding** — if no coordinates: Nominatim lookup on `{name}, {city}, {country}`
6. **Upsert**`findDuplicateChurch()` match → create or update church + mass schedules
7. **Rate limiting** — 3s between HTTP requests; 1.1s between Nominatim requests
### Data Extraction
Church page HTML structure (identical across all sites):
```html
<h1>Church Name</h1>
<table>
<tr><td>Endereço</td><td>Rua X, 123, City</td></tr>
<tr><td>Telefone</td><td>+55 11 1234-5678</td></tr>
<tr><td>Site</td><td>https://...</td></tr>
</table>
<table> <!-- mass schedule -->
<tr><th>Dia</th><th>Horário da Missa</th></tr>
<tr><td>Segunda-feira</td><td>08:00</td></tr>
...
</table>
```
Multiple times per day are comma- or newline-separated within the time cell.
### Day Maps
- **Portuguese (br)**: Segunda-feira=1, Terça-feira=2, Quarta-feira=3, Quinta-feira=4, Sexta-feira=5, Sábado=6, Domingo=0
- **Spanish (mx/ar/co/cl)**: Lunes=1, Martes=2, Miércoles=3, Jueves=4, Viernes=5, Sábado=6, Domingo=0
- **English (gb)**: Monday=1, Tuesday=2, Wednesday=3, Thursday=4, Friday=5, Saturday=6, Sunday=0
- **French (ch)**: Lundi=1, Mardi=2, Mercredi=3, Jeudi=4, Vendredi=5, Samedi=6, Dimanche=0
### URL Exclusion Patterns
Skip non-church pages matching:
- `/about*/`, `/contact*/`, `/privacy*/`, `/cookie*/`, `/terms*/`
- `/blog/`, `/news*/`, `/noticias/`, `/actualidad/`
- Any page with < 3 path segments (state/city index pages)
## CLI Interface
```
npx tsx scripts/import-buscarmisas-network.ts --all --site br
npx tsx scripts/import-buscarmisas-network.ts --all --site mx --dry-run
npx tsx scripts/import-buscarmisas-network.ts --all --site gb --limit 300
npx tsx scripts/import-buscarmisas-network.ts --all --site ar --resume-from 500
npx tsx scripts/import-buscarmisas-network.ts --all --site co --job-id {uuid}
```
| Flag | Description |
|------|-------------|
| `--site <code>` | Required. Which site to import (br/mx/ar/co/cl/gb/ch) |
| `--all` | Run full import |
| `--dry-run` | Parse only, no DB writes |
| `--limit N` | Cap churches processed per run |
| `--resume-from N` | Skip first N church URLs |
| `--job-id <uuid>` | Bind to background job record |
## Scheduler Integration
7 new sequential phases appended to the `imports` pipeline group in `scripts/scheduler.ts`:
```typescript
{ name: 'buscarmisas-br', type: 'buscarmisas-network-import', config: { site: 'br', limit: 300 } },
{ name: 'buscarmisas-mx', type: 'buscarmisas-network-import', config: { site: 'mx', limit: 300 } },
{ name: 'buscarmisas-ar', type: 'buscarmisas-network-import', config: { site: 'ar', limit: 300 } },
{ name: 'buscarmisas-co', type: 'buscarmisas-network-import', config: { site: 'co', limit: 300 } },
{ name: 'buscarmisas-cl', type: 'buscarmisas-network-import', config: { site: 'cl', limit: 300 } },
{ name: 'buscarmisas-gb', type: 'buscarmisas-network-import', config: { site: 'gb', limit: 300 } },
{ name: 'buscarmisas-ch', type: 'buscarmisas-network-import', config: { site: 'ch', limit: 300 } },
```
New `getJobCommand` case:
```typescript
case 'buscarmisas-network-import': {
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--all'];
if (config?.site) args.push('--site', String(config.site));
if (config?.limit) args.push('--limit', String(config.limit));
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
```
## Error Handling
- Failed fetches: log and skip, continue with next church
- Parse failures: log and skip (don't crash)
- DB errors during upsert: log and skip
- Unhandled exception in `main()`: catch → update job to `failed` → rethrow (same pattern as other importers)
## Files Modified
- **New**: `scripts/import-buscarmisas-network.ts`
- **Modified**: `scripts/scheduler.ts` — add 7 pipeline phases + `getJobCommand` case