# Spain Church Importer (horariosmisas.com) — Design ## Overview Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing. ## Data Source - **Site:** https://horariosmisas.com - **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces - **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants) - **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass. - **robots.txt:** Fully permissive (`User-agent: * / Disallow:`) - **Sitemaps:** 20 post sitemaps + 7 category sitemaps ## Architecture ### Two-Pass Approach **Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates. **Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit. ### Schema Change Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers. ### URL Structure - Sitemap index: `/sitemap_index.xml` → 20 post sitemaps - Church pages: `/{province}/{city}/{church-slug}/` - Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc. ### HTML Parsing - **Name:** `

Church Name (City)

` — strip `(City)` suffix - **Address:** `

📌 Street, PostalCode City (Province)

` - **Phone:** `Teléfono: ...` - **Website:** `Página Web: ...` - **Schedule:** `` with `DÍA`/`HORARIO` columns - Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno` - Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer) - Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos - Day ranges: "Lunes a Viernes" (Monday-Friday) - Time format: `HH:MMh` (24-hour), multiple per cell via `
` - Annotations stripped: `(familias)`, etc. ### Matching Strategy 1. `horariosMisasId` exact match (for re-imports) 2. Name + proximity against existing Spanish churches (from OSM) 3. Unmatched: create new church with address, country=ES, no coordinates ### CLI ``` npx tsx scripts/import-horariosmisas.ts --all npx tsx scripts/import-horariosmisas.ts --all --dry-run npx tsx scripts/import-horariosmisas.ts --province madrid npx tsx scripts/import-horariosmisas.ts --all --geocode npx tsx scripts/import-horariosmisas.ts --geocode-only npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000 ``` ### Rate Limiting - Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours) - Geocode: 1s between requests (Nominatim public API limit) ### Scheduler Integration Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).