Files
ScraperControl/docs/superpowers/specs/2026-03-16-buscarmisas-network-importer-design.md

136 lines
4.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BuscarMisas Network Importer — Design Spec
**Date:** 2026-03-16
**Status:** Approved
---
## Overview
Add a single importer `scripts/import-buscarmisas-network.ts` that scrapes church data and mass schedules from a network of 5 identical WordPress-based Catholic mass-time directories covering 5 Latin American countries (~15,294 churches total).
---
## Network Sites
| Domain | Country | Churches | Language | Sitemap Type |
|--------|---------|----------|----------|--------------|
| `horariosmissa.com.br` | BR (Brazil) | ~4,732 | Portuguese | `page-sitemap*.xml` |
| `buscarmisas.com.mx` | MX (Mexico) | ~3,950 | Spanish | `page-sitemap*.xml` |
| `horariosmisa.com.ar` | AR (Argentina) | ~3,012 | Spanish | `page-sitemap*.xml` |
| `buscarmisas.co` | CO (Colombia) | ~2,665 | Spanish | `page-sitemap*.xml` |
| `horariomisa.cl` | CL (Chile) | ~935 | Spanish | `post-sitemap.xml` |
---
## Architecture
### Config Map
```ts
interface SiteConfig {
country: string; // ISO 3166-1 alpha-2
language: 'pt' | 'es';
sitemapType: 'page' | 'post';
}
const NETWORK_SITES: Record<string, SiteConfig> = {
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
'buscarmisas.com.mx': { country: 'MX', language: 'es', sitemapType: 'page' },
'horariosmisa.com.ar': { country: 'AR', language: 'es', sitemapType: 'page' },
'buscarmisas.co': { country: 'CO', language: 'es', sitemapType: 'page' },
'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' },
};
```
### CLI Interface
```bash
# Single domain
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br
# All domains sequentially
npx tsx scripts/import-buscarmisas-network.ts --all
# Resume after interruption
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --resume-from 1200
# Dry run (no DB writes)
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run
```
Source slug stored in DB: domain with dots replaced by dashes, e.g. `horariosmissa-com-br`.
---
## Data Flow
### 1. Sitemap Discovery
- Fetch `https://{domain}/sitemap_index.xml`
- Extract child sitemap URLs
- For `sitemapType: 'page'`: collect all `page-sitemap*.xml` URLs
- For `sitemapType: 'post'`: collect `post-sitemap.xml`
- Fetch each child sitemap, filter to 3-segment church URLs (path depth = `/{region}/{city}/{church-slug}/`)
- Collect deduplicated list of church page URLs
### 2. Church Page Parsing
For each church URL, fetch the HTML and extract:
| Field | Source |
|-------|--------|
| Name | Table cell after `<strong>Nome</strong>` (PT) or `<strong>Nombre</strong>` (ES) |
| Address | Table cell after `<strong>Endereço</strong>` (PT) or `<strong>Dirección</strong>` (ES) |
| Phone | `href="tel:..."` anchor |
| Latitude/Longitude | Google Maps iframe `src``center={lat}%2C{lng}` parameter |
| Country | From `SiteConfig.country` |
| State/Region | 1st path segment of church URL |
| City | 2nd path segment of church URL |
| Mass schedule | MonSun table rows: day name → time string (skip `-` entries) |
### 3. Schedule Parsing
- Day names resolved via existing `src/scrapers/i18n/day-names.ts` patterns
- Portuguese: `Segunda-feira`, `Terça-feira`, `Quarta-feira`, `Quinta-feira`, `Sexta-feira`, `Sábado`, `Domingo`
- Spanish: `Lunes`, `Martes`, `Miércoles`, `Jueves`, `Viernes`, `Sábado`, `Domingo`
- Times are comma-separated within a single cell (e.g. `10:00, 18:00`) — split and create one schedule entry per time
- `-` entries indicate no mass that day — skip
### 4. Upsert to Database
- Use `findDuplicateChurch` (existing church-matcher) to check for duplicates before insert
- Upsert church with `source = {domain-slug}`
- Upsert mass schedules linked to church
- Track progress: log every 100 churches, support `--resume-from {index}`
### 5. Rate Limiting
- 2-second delay between church page fetches (same as `import-discovermass.ts`)
- 5-second delay between domains when running `--all`
- Respect HTTP 429 / 503 with exponential backoff (up to 3 retries)
---
## Error Handling
- Skip churches where lat/lng cannot be extracted (log warning, continue)
- Skip churches where name is empty
- On fetch error: log and continue to next URL (don't abort the run)
- On DB error: log and continue
---
## Integration
- Add `import:buscarmisas-network` script to `package.json`
- Add to scheduler pipeline alongside other importers
- No new dependencies required
---
## Out of Scope
- The `horairemesses.ch` (Switzerland), `gottesdienstheute.de` (Germany), and `masstime.co.uk` (UK) network sites are excluded — those countries already have dedicated importers
- Chile's `page-sitemap.xml` contains only city pages (not churches) — only `post-sitemap.xml` is used for Chile