docs: add design spec for buscarmisas network importer
Covers 7-country Latin America + UK + Switzerland mass times network (horariosmissa.com.br and sister sites), all sharing identical WordPress structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,145 @@
|
||||
# Design: BuscarMisas Network Importer
|
||||
|
||||
**Date:** 2026-03-12
|
||||
**Script:** `scripts/import-buscarmisas-network.ts`
|
||||
|
||||
## Overview
|
||||
|
||||
A single importer script covering a network of 7 Catholic mass-time directory sites that share identical WordPress/Yoast structure. All sites use the URL pattern `/{region}/{city}/{slug}/` and the same HTML layout with a mass schedule table. One script with a `--site <code>` flag handles all countries.
|
||||
|
||||
## Sites
|
||||
|
||||
| Code | Domain | Country | Language | Est. Churches |
|
||||
|------|--------|---------|----------|---------------|
|
||||
| `br` | horariosmissa.com.br | Brazil | Portuguese | ~2,000 |
|
||||
| `mx` | buscarmisas.com.mx | Mexico | Spanish | ~2,000 |
|
||||
| `ar` | horariosmisa.com.ar | Argentina | Spanish | ~2,000 |
|
||||
| `co` | buscarmisas.co | Colombia | Spanish | ~1,000 |
|
||||
| `cl` | horariomisa.cl | Chile | Spanish | ~1,000 |
|
||||
| `gb` | masstime.co.uk | United Kingdom | English | ~1,000 |
|
||||
| `ch` | horairemesses.ch | Switzerland | French | ~500 |
|
||||
|
||||
**Total: ~10,500 new churches across 7 countries**
|
||||
|
||||
## Architecture
|
||||
|
||||
### Site Registry
|
||||
|
||||
```typescript
|
||||
const SITES: Record<string, SiteConfig> = {
|
||||
br: { baseUrl: 'https://horariosmissa.com.br', country: 'BR', language: 'pt' },
|
||||
mx: { baseUrl: 'https://buscarmisas.com.mx', country: 'MX', language: 'es' },
|
||||
ar: { baseUrl: 'https://horariosmisa.com.ar', country: 'AR', language: 'es' },
|
||||
co: { baseUrl: 'https://buscarmisas.co', country: 'CO', language: 'es' },
|
||||
cl: { baseUrl: 'https://horariomisa.cl', country: 'CL', language: 'es' },
|
||||
gb: { baseUrl: 'https://masstime.co.uk', country: 'GB', language: 'en' },
|
||||
ch: { baseUrl: 'https://horairemesses.ch', country: 'CH', language: 'fr' },
|
||||
};
|
||||
```
|
||||
|
||||
Each `SiteConfig` includes a `dayMap: Record<string, number>` mapping localized day names to 0–6 (Sun–Sat).
|
||||
|
||||
### Processing Flow
|
||||
|
||||
1. **Sitemap discovery** — fetch `{baseUrl}/sitemap_index.xml` → extract `page-sitemap*.xml` URLs
|
||||
2. **URL collection** — fetch each page-sitemap → extract church URLs filtering to exactly 3 path segments (`/{region}/{city}/{slug}/`)
|
||||
3. **Dedup skip** — load existing DB churches for that country; skip URLs whose slug is already stored as `source`+`sourceId`
|
||||
4. **Per-church fetch** — GET church page, parse HTML:
|
||||
- **Name**: H1 heading
|
||||
- **Address**: contact info table (`Endereço`/`Dirección`/`Address`/`Adresse`)
|
||||
- **City/Region**: from URL path segments or table
|
||||
- **Phone**: table row
|
||||
- **Website**: table row
|
||||
- **Mass schedule**: `<table>` with day column and time column
|
||||
5. **Geocoding** — if no coordinates: Nominatim lookup on `{name}, {city}, {country}`
|
||||
6. **Upsert** — `findDuplicateChurch()` match → create or update church + mass schedules
|
||||
7. **Rate limiting** — 3s between HTTP requests; 1.1s between Nominatim requests
|
||||
|
||||
### Data Extraction
|
||||
|
||||
Church page HTML structure (identical across all sites):
|
||||
```html
|
||||
<h1>Church Name</h1>
|
||||
<table>
|
||||
<tr><td>Endereço</td><td>Rua X, 123, City</td></tr>
|
||||
<tr><td>Telefone</td><td>+55 11 1234-5678</td></tr>
|
||||
<tr><td>Site</td><td>https://...</td></tr>
|
||||
</table>
|
||||
<table> <!-- mass schedule -->
|
||||
<tr><th>Dia</th><th>Horário da Missa</th></tr>
|
||||
<tr><td>Segunda-feira</td><td>08:00</td></tr>
|
||||
...
|
||||
</table>
|
||||
```
|
||||
|
||||
Multiple times per day are comma- or newline-separated within the time cell.
|
||||
|
||||
### Day Maps
|
||||
|
||||
- **Portuguese (br)**: Segunda-feira=1, Terça-feira=2, Quarta-feira=3, Quinta-feira=4, Sexta-feira=5, Sábado=6, Domingo=0
|
||||
- **Spanish (mx/ar/co/cl)**: Lunes=1, Martes=2, Miércoles=3, Jueves=4, Viernes=5, Sábado=6, Domingo=0
|
||||
- **English (gb)**: Monday=1, Tuesday=2, Wednesday=3, Thursday=4, Friday=5, Saturday=6, Sunday=0
|
||||
- **French (ch)**: Lundi=1, Mardi=2, Mercredi=3, Jeudi=4, Vendredi=5, Samedi=6, Dimanche=0
|
||||
|
||||
### URL Exclusion Patterns
|
||||
|
||||
Skip non-church pages matching:
|
||||
- `/about*/`, `/contact*/`, `/privacy*/`, `/cookie*/`, `/terms*/`
|
||||
- `/blog/`, `/news*/`, `/noticias/`, `/actualidad/`
|
||||
- Any page with < 3 path segments (state/city index pages)
|
||||
|
||||
## CLI Interface
|
||||
|
||||
```
|
||||
npx tsx scripts/import-buscarmisas-network.ts --all --site br
|
||||
npx tsx scripts/import-buscarmisas-network.ts --all --site mx --dry-run
|
||||
npx tsx scripts/import-buscarmisas-network.ts --all --site gb --limit 300
|
||||
npx tsx scripts/import-buscarmisas-network.ts --all --site ar --resume-from 500
|
||||
npx tsx scripts/import-buscarmisas-network.ts --all --site co --job-id {uuid}
|
||||
```
|
||||
|
||||
| Flag | Description |
|
||||
|------|-------------|
|
||||
| `--site <code>` | Required. Which site to import (br/mx/ar/co/cl/gb/ch) |
|
||||
| `--all` | Run full import |
|
||||
| `--dry-run` | Parse only, no DB writes |
|
||||
| `--limit N` | Cap churches processed per run |
|
||||
| `--resume-from N` | Skip first N church URLs |
|
||||
| `--job-id <uuid>` | Bind to background job record |
|
||||
|
||||
## Scheduler Integration
|
||||
|
||||
7 new sequential phases appended to the `imports` pipeline group in `scripts/scheduler.ts`:
|
||||
|
||||
```typescript
|
||||
{ name: 'buscarmisas-br', type: 'buscarmisas-network-import', config: { site: 'br', limit: 300 } },
|
||||
{ name: 'buscarmisas-mx', type: 'buscarmisas-network-import', config: { site: 'mx', limit: 300 } },
|
||||
{ name: 'buscarmisas-ar', type: 'buscarmisas-network-import', config: { site: 'ar', limit: 300 } },
|
||||
{ name: 'buscarmisas-co', type: 'buscarmisas-network-import', config: { site: 'co', limit: 300 } },
|
||||
{ name: 'buscarmisas-cl', type: 'buscarmisas-network-import', config: { site: 'cl', limit: 300 } },
|
||||
{ name: 'buscarmisas-gb', type: 'buscarmisas-network-import', config: { site: 'gb', limit: 300 } },
|
||||
{ name: 'buscarmisas-ch', type: 'buscarmisas-network-import', config: { site: 'ch', limit: 300 } },
|
||||
```
|
||||
|
||||
New `getJobCommand` case:
|
||||
```typescript
|
||||
case 'buscarmisas-network-import': {
|
||||
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--all'];
|
||||
if (config?.site) args.push('--site', String(config.site));
|
||||
if (config?.limit) args.push('--limit', String(config.limit));
|
||||
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
||||
return { command: 'npx', args };
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
- Failed fetches: log and skip, continue with next church
|
||||
- Parse failures: log and skip (don't crash)
|
||||
- DB errors during upsert: log and skip
|
||||
- Unhandled exception in `main()`: catch → update job to `failed` → rethrow (same pattern as other importers)
|
||||
|
||||
## Files Modified
|
||||
|
||||
- **New**: `scripts/import-buscarmisas-network.ts`
|
||||
- **Modified**: `scripts/scheduler.ts` — add 7 pipeline phases + `getJobCommand` case
|
||||
Reference in New Issue
Block a user