Files
ScraperControl/docs/superpowers/specs/2026-03-16-buscarmisas-network-importer-design.md

185 lines
7.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BuscarMisas Network Importer — Design Spec
**Date:** 2026-03-16
**Status:** Approved
---
## Overview
Add a single importer `scripts/import-buscarmisas-network.ts` that scrapes church data and mass schedules from a network of 5 identical WordPress-based Catholic mass-time directories covering 5 Latin American countries (~15,294 churches total).
---
## Network Sites
| Domain | Country | Churches | Language | Sitemap Type |
|--------|---------|----------|----------|--------------|
| `horariosmissa.com.br` | BR (Brazil) | ~4,732 | Portuguese | `page-sitemap*.xml` |
| `buscarmisas.com.mx` | MX (Mexico) | ~3,950 | Spanish | `page-sitemap*.xml` |
| `horariosmisa.com.ar` | AR (Argentina) | ~3,012 | Spanish | `page-sitemap*.xml` |
| `buscarmisas.co` | CO (Colombia) | ~2,665 | Spanish | `page-sitemap*.xml` |
| `horariomisa.cl` | CL (Chile) | ~935 | Spanish | `post-sitemap.xml` |
---
## Schema Migration (prerequisite)
A new column must be added in **BethelGuide** (schema source of truth) before implementation:
```prisma
buscarmisasNetworkId String? @unique @map("buscamissas_network_id")
@@index([buscarmisasNetworkId])
```
After merging the migration in BethelGuide, copy the updated `schema.prisma` to ScraperControl and run `npx prisma generate`.
The external ID format is `{domain-slug}/{church-slug}`, e.g.:
`horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios`
where `domain-slug` replaces `.` with `-`, and `church-slug` is the final path segment of the church URL.
`church-matcher.ts` and the `ExistingChurch` / `ChurchCandidate` interfaces must be updated to include `buscarmisasNetworkId` alongside the existing external ID fields, with a corresponding ID-match pass in `findDuplicateChurch()`.
---
## Architecture
### Config Map
```ts
interface SiteConfig {
country: string; // ISO 3166-1 alpha-2
language: 'pt' | 'es';
sitemapType: 'page' | 'post';
}
const NETWORK_SITES: Record<string, SiteConfig> = {
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
'buscarmisas.com.mx': { country: 'MX', language: 'es', sitemapType: 'page' },
'horariosmisa.com.ar': { country: 'AR', language: 'es', sitemapType: 'page' },
'buscarmisas.co': { country: 'CO', language: 'es', sitemapType: 'page' },
'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' },
};
```
### CLI Interface
```bash
# Single domain
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br
# Single domain with resume (--resume-from only valid with --domain)
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --resume-from 1200
# All domains sequentially (no --resume-from; use --domain for resuming individual runs)
npx tsx scripts/import-buscarmisas-network.ts --all
# Dry run (no DB writes)
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run
```
**Validation:** If `--domain` is provided but not present in `NETWORK_SITES`, exit immediately with a clear error message listing valid domains. `--resume-from` combined with `--all` is also an error — exit with usage message.
Source slug stored in DB `source` field: `buscarmisas-network` (same value for all domains — the `buscarmisasNetworkId` distinguishes per-church).
---
## Data Flow
### 1. Sitemap Discovery
- Fetch `https://{domain}/sitemap_index.xml`
- Extract child sitemap URLs
- For `sitemapType: 'page'`: collect all `page-sitemap*.xml` URLs (ignore `post-sitemap*.xml` and `page-sitemap.xml` city-only entries for Chile)
- For `sitemapType: 'post'`: collect `post-sitemap.xml` only
- Fetch each child sitemap, filter to 3-segment church URLs (path segments = `/{region}/{city}/{church-slug}/`)
- Collect deduplicated list of church page URLs
### 2. Church Page Parsing
For each church URL, fetch the HTML and extract:
| Field | Source |
|-------|--------|
| Name | Table cell after `<strong>Nome</strong>` (PT) or `<strong>Nombre</strong>` (ES) |
| Address | Table cell after `<strong>Endereço</strong>` (PT) or `<strong>Dirección</strong>` (ES) |
| Phone | `href="tel:..."` anchor |
| Latitude/Longitude | Google Maps iframe `src``center={lat}%2C{lng}` parameter (confirmed present on all 5 network sites; same API key `AIzaSyCNTEOso0tZG6YMSJFoaJEY5Th1stEWrJI` used across the network) |
| Country | From `SiteConfig.country` |
| State/Region | 1st path segment of church URL (URL-decoded) |
| City | 2nd path segment of church URL (URL-decoded) |
| Mass schedule | MonSun table rows: day name → time string (skip `-` entries) |
| External ID | `{domain-slug}/{church-slug}` |
If `center=` is absent from the iframe src, skip the church with a warning log. Do not fall back to the `q=` parameter (it contains a search query, not coordinates).
### 3. Schedule Parsing
- Use `getDayNamesForCountry(config.country)` from `src/scrapers/i18n/day-names.ts` to get the day-name map keyed by country code (`'BR'`, `'MX'`, etc.)
- Build patterns with `buildDayPatterns(dayNames)` and match against the table's day-name cell text
- Times are comma-separated within a single cell (e.g. `10:00, 18:00`) — split on `,` and create one schedule entry per time
- `-` entries indicate no mass that day — skip
### 4. Upsert to Database
- Load all existing `buscarmisasNetworkId` values from DB into a `Set` at startup — skip already-imported churches (same pattern as `import-discovermass.ts`)
- Use `findDuplicateChurch()` for new churches to detect cross-source duplicates
- Upsert church with `source = 'buscarmisas-network'` and `buscarmisasNetworkId = externalId`
- Upsert mass schedules linked to church
- Log progress every 100 churches; support `--resume-from {index}` (single-domain mode only)
### 5. Rate Limiting
- 2-second delay between church page fetches
- 5-second delay between domains when running `--all`
- On HTTP 429 or 503: exponential backoff, up to 3 retries, then skip with warning
---
## Error Handling
- Skip churches where `center=` lat/lng is absent (log warning, continue)
- Skip churches where name is empty after parsing
- On fetch error: log and continue to next URL
- On DB error: log and continue
---
## Integration
### package.json
```json
"import:buscarmisas-network": "tsx scripts/import-buscarmisas-network.ts"
```
### Scheduler (`scripts/scheduler.ts`)
Add one `PipelinePhase` per domain (5 total) so each country can be scheduled and monitored independently. Each phase's `type` string must match exactly between `PIPELINE_GROUPS` and the `case` label in `getJobCommand()` — mismatch silently throws "Unknown job type". The `type` field in the DB `BackgroundJob` model is a plain `String`, consistent with existing values like `'discovermass-import'`.
All 5 phases and their corresponding `case` blocks:
| Phase `type` | `--domain` |
|---|---|
| `buscarmisas-network-BR` | `horariosmissa.com.br` |
| `buscarmisas-network-MX` | `buscarmisas.com.mx` |
| `buscarmisas-network-AR` | `horariosmisa.com.ar` |
| `buscarmisas-network-CO` | `buscarmisas.co` |
| `buscarmisas-network-CL` | `horariomisa.cl` |
```ts
case 'buscarmisas-network-BR':
return `npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br`;
case 'buscarmisas-network-MX':
return `npx tsx scripts/import-buscarmisas-network.ts --domain buscarmisas.com.mx`;
// ... etc for AR, CO, CL
```
---
## Out of Scope
- The `horairemesses.ch` (Switzerland), `gottesdienstheute.de` (Germany), and `masstime.co.uk` (UK) network sites are excluded — those countries already have dedicated importers
- Chile's `page-sitemap.xml` contains only city pages (not churches) — only `post-sitemap.xml` is used for Chile