Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.0 KiB
3.0 KiB
Spain Church Importer (horariosmisas.com) — Design
Overview
Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
Data Source
- Site: https://horariosmisas.com
- Coverage: 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
- Data: Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
- No coordinates — addresses only. Forward geocoding via Nominatim as a separate pass.
- robots.txt: Fully permissive (
User-agent: * / Disallow:) - Sitemaps: 20 post sitemaps + 7 category sitemaps
Architecture
Two-Pass Approach
Pass 1: Import — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
Pass 2: Geocode — Forward-geocode unmatched churches via Nominatim public API (address → lat/lng). 1 req/sec rate limit.
Schema Change
Add horariosMisasId String? @unique to Church model (same pattern as philmassId, massSchedulesPhId). Update church matcher and all existing importers.
URL Structure
- Sitemap index:
/sitemap_index.xml→ 20 post sitemaps - Church pages:
/{province}/{city}/{church-slug}/ - Non-church posts (filtered out):
/misas-diarias/,/santos-del-dia/,/oraciones/, etc.
HTML Parsing
- Name:
<h1>Church Name (City)</h1>— strip(City)suffix - Address:
<p>📌 <strong>Street, PostalCode City (Province)</strong></p> - Phone:
<strong>Teléfono:</strong> <a href="tel:...">...</a> - Website:
<strong>Página Web:</strong> <a href="...">...</a> - Schedule:
<table>withDÍA/HORARIOcolumns- Two seasonal tables:
☀️ Horario de veranoand⛄ Misas en invierno - Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
- Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
- Day ranges: "Lunes a Viernes" (Monday-Friday)
- Time format:
HH:MMh(24-hour), multiple per cell via<br> - Annotations stripped:
(familias), etc.
- Two seasonal tables:
Matching Strategy
horariosMisasIdexact match (for re-imports)- Name + proximity against existing Spanish churches (from OSM)
- Unmatched: create new church with address, country=ES, no coordinates
CLI
npx tsx scripts/import-horariosmisas.ts --all
npx tsx scripts/import-horariosmisas.ts --all --dry-run
npx tsx scripts/import-horariosmisas.ts --province madrid
npx tsx scripts/import-horariosmisas.ts --all --geocode
npx tsx scripts/import-horariosmisas.ts --geocode-only
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
Rate Limiting
- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
- Geocode: 1s between requests (Nominatim public API limit)
Scheduler Integration
Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).