73 lines
3.0 KiB
Markdown
73 lines
3.0 KiB
Markdown
|
|
# Spain Church Importer (horariosmisas.com) — Design
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
|
|||
|
|
|
|||
|
|
## Data Source
|
|||
|
|
|
|||
|
|
- **Site:** https://horariosmisas.com
|
|||
|
|
- **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
|
|||
|
|
- **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
|
|||
|
|
- **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass.
|
|||
|
|
- **robots.txt:** Fully permissive (`User-agent: * / Disallow:`)
|
|||
|
|
- **Sitemaps:** 20 post sitemaps + 7 category sitemaps
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
### Two-Pass Approach
|
|||
|
|
|
|||
|
|
**Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
|
|||
|
|
|
|||
|
|
**Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit.
|
|||
|
|
|
|||
|
|
### Schema Change
|
|||
|
|
|
|||
|
|
Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers.
|
|||
|
|
|
|||
|
|
### URL Structure
|
|||
|
|
|
|||
|
|
- Sitemap index: `/sitemap_index.xml` → 20 post sitemaps
|
|||
|
|
- Church pages: `/{province}/{city}/{church-slug}/`
|
|||
|
|
- Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc.
|
|||
|
|
|
|||
|
|
### HTML Parsing
|
|||
|
|
|
|||
|
|
- **Name:** `<h1>Church Name (City)</h1>` — strip `(City)` suffix
|
|||
|
|
- **Address:** `<p>📌 <strong>Street, PostalCode City (Province)</strong></p>`
|
|||
|
|
- **Phone:** `<strong>Teléfono:</strong> <a href="tel:...">...</a>`
|
|||
|
|
- **Website:** `<strong>Página Web:</strong> <a href="...">...</a>`
|
|||
|
|
- **Schedule:** `<table>` with `DÍA`/`HORARIO` columns
|
|||
|
|
- Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno`
|
|||
|
|
- Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
|
|||
|
|
- Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
|
|||
|
|
- Day ranges: "Lunes a Viernes" (Monday-Friday)
|
|||
|
|
- Time format: `HH:MMh` (24-hour), multiple per cell via `<br>`
|
|||
|
|
- Annotations stripped: `(familias)`, etc.
|
|||
|
|
|
|||
|
|
### Matching Strategy
|
|||
|
|
|
|||
|
|
1. `horariosMisasId` exact match (for re-imports)
|
|||
|
|
2. Name + proximity against existing Spanish churches (from OSM)
|
|||
|
|
3. Unmatched: create new church with address, country=ES, no coordinates
|
|||
|
|
|
|||
|
|
### CLI
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
npx tsx scripts/import-horariosmisas.ts --all
|
|||
|
|
npx tsx scripts/import-horariosmisas.ts --all --dry-run
|
|||
|
|
npx tsx scripts/import-horariosmisas.ts --province madrid
|
|||
|
|
npx tsx scripts/import-horariosmisas.ts --all --geocode
|
|||
|
|
npx tsx scripts/import-horariosmisas.ts --geocode-only
|
|||
|
|
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Rate Limiting
|
|||
|
|
|
|||
|
|
- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
|
|||
|
|
- Geocode: 1s between requests (Nominatim public API limit)
|
|||
|
|
|
|||
|
|
### Scheduler Integration
|
|||
|
|
|
|||
|
|
Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).
|