Files
ScraperControl/docs/plans/2026-02-26-horariosmisas-spain-design.md
Albert 2c51513851 chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-12 19:11:22 -04:00

3.0 KiB
Raw Blame History

Spain Church Importer (horariosmisas.com) — Design

Overview

Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.

Data Source

  • Site: https://horariosmisas.com
  • Coverage: 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
  • Data: Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
  • No coordinates — addresses only. Forward geocoding via Nominatim as a separate pass.
  • robots.txt: Fully permissive (User-agent: * / Disallow:)
  • Sitemaps: 20 post sitemaps + 7 category sitemaps

Architecture

Two-Pass Approach

Pass 1: Import — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.

Pass 2: Geocode — Forward-geocode unmatched churches via Nominatim public API (address → lat/lng). 1 req/sec rate limit.

Schema Change

Add horariosMisasId String? @unique to Church model (same pattern as philmassId, massSchedulesPhId). Update church matcher and all existing importers.

URL Structure

  • Sitemap index: /sitemap_index.xml → 20 post sitemaps
  • Church pages: /{province}/{city}/{church-slug}/
  • Non-church posts (filtered out): /misas-diarias/, /santos-del-dia/, /oraciones/, etc.

HTML Parsing

  • Name: <h1>Church Name (City)</h1> — strip (City) suffix
  • Address: <p>📌 <strong>Street, PostalCode City (Province)</strong></p>
  • Phone: <strong>Teléfono:</strong> <a href="tel:...">...</a>
  • Website: <strong>Página Web:</strong> <a href="...">...</a>
  • Schedule: <table> with DÍA/HORARIO columns
    • Two seasonal tables: ☀️ Horario de verano and ⛄ Misas en invierno
    • Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
    • Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
    • Day ranges: "Lunes a Viernes" (Monday-Friday)
    • Time format: HH:MMh (24-hour), multiple per cell via <br>
    • Annotations stripped: (familias), etc.

Matching Strategy

  1. horariosMisasId exact match (for re-imports)
  2. Name + proximity against existing Spanish churches (from OSM)
  3. Unmatched: create new church with address, country=ES, no coordinates

CLI

npx tsx scripts/import-horariosmisas.ts --all
npx tsx scripts/import-horariosmisas.ts --all --dry-run
npx tsx scripts/import-horariosmisas.ts --province madrid
npx tsx scripts/import-horariosmisas.ts --all --geocode
npx tsx scripts/import-horariosmisas.ts --geocode-only
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000

Rate Limiting

  • Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
  • Geocode: 1s between requests (Nominatim public API limit)

Scheduler Integration

Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).