chore: sync with Gitea master and restore local-only files

Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Albert
2026-04-12 19:11:22 -04:00
parent 76cca3ba75
commit 2c51513851
133 changed files with 30381 additions and 0 deletions

View File

@@ -0,0 +1,72 @@
# Spain Church Importer (horariosmisas.com) — Design
## Overview
Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
## Data Source
- **Site:** https://horariosmisas.com
- **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
- **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
- **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass.
- **robots.txt:** Fully permissive (`User-agent: * / Disallow:`)
- **Sitemaps:** 20 post sitemaps + 7 category sitemaps
## Architecture
### Two-Pass Approach
**Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
**Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit.
### Schema Change
Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers.
### URL Structure
- Sitemap index: `/sitemap_index.xml` → 20 post sitemaps
- Church pages: `/{province}/{city}/{church-slug}/`
- Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc.
### HTML Parsing
- **Name:** `<h1>Church Name (City)</h1>` — strip `(City)` suffix
- **Address:** `<p>📌 <strong>Street, PostalCode City (Province)</strong></p>`
- **Phone:** `<strong>Teléfono:</strong> <a href="tel:...">...</a>`
- **Website:** `<strong>Página Web:</strong> <a href="...">...</a>`
- **Schedule:** `<table>` with `DÍA`/`HORARIO` columns
- Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno`
- Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
- Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
- Day ranges: "Lunes a Viernes" (Monday-Friday)
- Time format: `HH:MMh` (24-hour), multiple per cell via `<br>`
- Annotations stripped: `(familias)`, etc.
### Matching Strategy
1. `horariosMisasId` exact match (for re-imports)
2. Name + proximity against existing Spanish churches (from OSM)
3. Unmatched: create new church with address, country=ES, no coordinates
### CLI
```
npx tsx scripts/import-horariosmisas.ts --all
npx tsx scripts/import-horariosmisas.ts --all --dry-run
npx tsx scripts/import-horariosmisas.ts --province madrid
npx tsx scripts/import-horariosmisas.ts --all --geocode
npx tsx scripts/import-horariosmisas.ts --geocode-only
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
```
### Rate Limiting
- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
- Geocode: 1s between requests (Nominatim public API limit)
### Scheduler Integration
Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).