chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
72
docs/plans/2026-02-26-horariosmisas-spain-design.md
Normal file
72
docs/plans/2026-02-26-horariosmisas-spain-design.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Spain Church Importer (horariosmisas.com) — Design
|
||||
|
||||
## Overview
|
||||
|
||||
Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
|
||||
|
||||
## Data Source
|
||||
|
||||
- **Site:** https://horariosmisas.com
|
||||
- **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
|
||||
- **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
|
||||
- **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass.
|
||||
- **robots.txt:** Fully permissive (`User-agent: * / Disallow:`)
|
||||
- **Sitemaps:** 20 post sitemaps + 7 category sitemaps
|
||||
|
||||
## Architecture
|
||||
|
||||
### Two-Pass Approach
|
||||
|
||||
**Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
|
||||
|
||||
**Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit.
|
||||
|
||||
### Schema Change
|
||||
|
||||
Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers.
|
||||
|
||||
### URL Structure
|
||||
|
||||
- Sitemap index: `/sitemap_index.xml` → 20 post sitemaps
|
||||
- Church pages: `/{province}/{city}/{church-slug}/`
|
||||
- Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc.
|
||||
|
||||
### HTML Parsing
|
||||
|
||||
- **Name:** `<h1>Church Name (City)</h1>` — strip `(City)` suffix
|
||||
- **Address:** `<p>📌 <strong>Street, PostalCode City (Province)</strong></p>`
|
||||
- **Phone:** `<strong>Teléfono:</strong> <a href="tel:...">...</a>`
|
||||
- **Website:** `<strong>Página Web:</strong> <a href="...">...</a>`
|
||||
- **Schedule:** `<table>` with `DÍA`/`HORARIO` columns
|
||||
- Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno`
|
||||
- Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
|
||||
- Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
|
||||
- Day ranges: "Lunes a Viernes" (Monday-Friday)
|
||||
- Time format: `HH:MMh` (24-hour), multiple per cell via `<br>`
|
||||
- Annotations stripped: `(familias)`, etc.
|
||||
|
||||
### Matching Strategy
|
||||
|
||||
1. `horariosMisasId` exact match (for re-imports)
|
||||
2. Name + proximity against existing Spanish churches (from OSM)
|
||||
3. Unmatched: create new church with address, country=ES, no coordinates
|
||||
|
||||
### CLI
|
||||
|
||||
```
|
||||
npx tsx scripts/import-horariosmisas.ts --all
|
||||
npx tsx scripts/import-horariosmisas.ts --all --dry-run
|
||||
npx tsx scripts/import-horariosmisas.ts --province madrid
|
||||
npx tsx scripts/import-horariosmisas.ts --all --geocode
|
||||
npx tsx scripts/import-horariosmisas.ts --geocode-only
|
||||
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
|
||||
- Geocode: 1s between requests (Nominatim public API limit)
|
||||
|
||||
### Scheduler Integration
|
||||
|
||||
Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).
|
||||
Reference in New Issue
Block a user