Files
ScraperControl/docs/plans/2026-02-26-horariosmisas-spain-design.md
Albert 2c51513851 chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-12 19:11:22 -04:00

73 lines
3.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Spain Church Importer (horariosmisas.com) — Design
## Overview
Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
## Data Source
- **Site:** https://horariosmisas.com
- **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
- **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
- **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass.
- **robots.txt:** Fully permissive (`User-agent: * / Disallow:`)
- **Sitemaps:** 20 post sitemaps + 7 category sitemaps
## Architecture
### Two-Pass Approach
**Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
**Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit.
### Schema Change
Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers.
### URL Structure
- Sitemap index: `/sitemap_index.xml` → 20 post sitemaps
- Church pages: `/{province}/{city}/{church-slug}/`
- Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc.
### HTML Parsing
- **Name:** `<h1>Church Name (City)</h1>` — strip `(City)` suffix
- **Address:** `<p>📌 <strong>Street, PostalCode City (Province)</strong></p>`
- **Phone:** `<strong>Teléfono:</strong> <a href="tel:...">...</a>`
- **Website:** `<strong>Página Web:</strong> <a href="...">...</a>`
- **Schedule:** `<table>` with `DÍA`/`HORARIO` columns
- Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno`
- Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
- Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
- Day ranges: "Lunes a Viernes" (Monday-Friday)
- Time format: `HH:MMh` (24-hour), multiple per cell via `<br>`
- Annotations stripped: `(familias)`, etc.
### Matching Strategy
1. `horariosMisasId` exact match (for re-imports)
2. Name + proximity against existing Spanish churches (from OSM)
3. Unmatched: create new church with address, country=ES, no coordinates
### CLI
```
npx tsx scripts/import-horariosmisas.ts --all
npx tsx scripts/import-horariosmisas.ts --all --dry-run
npx tsx scripts/import-horariosmisas.ts --province madrid
npx tsx scripts/import-horariosmisas.ts --all --geocode
npx tsx scripts/import-horariosmisas.ts --geocode-only
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
```
### Rate Limiting
- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
- Geocode: 1s between requests (Nominatim public API limit)
### Scheduler Integration
Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).