Files
ScraperControl/docs/plans/2026-02-26-horariosmisas-spain.md
Albert 2c51513851 chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-12 19:11:22 -04:00

323 lines
9.7 KiB
Markdown

# Spain Church Importer (horariosmisas.com) — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Import ~10,000 Spanish churches with mass schedules from horariosmisas.com, with optional Nominatim forward geocoding for unmatched churches.
**Architecture:** Sitemap-driven importer. Fetch 20 post sitemaps for church URLs, parse static WordPress HTML for names/addresses/schedule tables, match against existing Spanish OSM churches, upsert with mass schedules. Separate geocoding pass via Nominatim public API.
**Tech Stack:** TypeScript, Prisma, HTML parsing (regex — no Playwright), Nominatim geocoding API.
---
## Task 1: Add `horariosMisasId` to Prisma Schema
**Files:**
- Modify: `prisma/schema.prisma`
**Step 1: Add field and index**
After the `philmassId` line (around line 38), add:
```prisma
horariosMisasId String? @unique @map("horarios_misas_id") // horariosmisas.com URL slug
```
And add an index in the `@@index` block (around line 78):
```prisma
@@index([horariosMisasId])
```
**Step 2: Push schema to NAS database**
```bash
npx prisma db push --accept-data-loss
```
Expected: `Your database is now in sync with your Prisma schema.`
**Step 3: Regenerate Prisma client**
```bash
npx prisma generate
```
**Step 4: Push schema to Neon production**
```bash
npx prisma db push --url "$(grep DATABASE_URL .env.production | sed 's/DATABASE_URL="//' | sed 's/"$//')" --accept-data-loss
```
**Step 5: Commit**
```bash
git add prisma/schema.prisma
git commit -m "feat: add horariosMisasId to Church model for Spain import"
```
---
## Task 2: Extend Church Matcher and Existing Importers
**Files:**
- Modify: `src/lib/church-matcher.ts`
- Modify: `scripts/import-osm-churches.ts`
- Modify: `scripts/import-gcatholic.ts`
- Modify: `scripts/import-baidu-churches.ts`
- Modify: `scripts/import-osm-region.ts`
- Modify: `scripts/import-orarimesse.ts`
- Modify: `scripts/import-mass-schedules-ph.ts`
- Modify: `scripts/import-philmass.ts`
### Step 1: Update church-matcher.ts
In `ExistingChurch` interface (line ~11-26), add after `philmassId`:
```typescript
horariosMisasId: string | null;
```
In `ChurchCandidate` type (line ~113-122), add after `philmassId`:
```typescript
horariosMisasId?: string;
```
In `findDuplicateChurch()`, add a new pass after the fifth pass (philmassId match, line ~169-175). Before the proximity+name pass:
```typescript
// Sixth pass: exact horariosMisasId match
if (candidate.horariosMisasId) {
const horariosMisasMatch = existingChurches.find(
(church) => church.horariosMisasId === candidate.horariosMisasId
);
if (horariosMisasMatch) return horariosMisasMatch;
}
```
Update the comment on the proximity pass to say "Seventh pass".
### Step 2: Update all existing importers
In every importer that queries churches with a `select` clause containing `philmassId: true`, add:
```typescript
horariosMisasId: true,
```
In every importer that creates/pushes churches with `philmassId: null`, add:
```typescript
horariosMisasId: null,
```
**Files to update:** `import-osm-churches.ts`, `import-gcatholic.ts`, `import-baidu-churches.ts`, `import-osm-region.ts`, `import-orarimesse.ts`, `import-mass-schedules-ph.ts`, `import-philmass.ts`
### Step 3: Verify build
```bash
npx tsc --noEmit
```
Expected: No errors.
### Step 4: Commit
```bash
git add src/lib/church-matcher.ts scripts/import-*.ts
git commit -m "feat: add horariosMisasId to church matcher and all importers"
```
---
## Task 3: Create `import-horariosmisas.ts`
**Files:**
- Create: `scripts/import-horariosmisas.ts`
### Architecture
This importer follows the exact same structure as `scripts/import-mass-schedules-ph.ts`. Key differences:
- **Sitemap:** Fetches 20 post sitemaps from sitemap index (not a single sitemap)
- **URL filtering:** Church URLs have 3 path segments (`/{province}/{city}/{slug}/`). Non-church URLs (blog posts, daily readings) are filtered out.
- **Schedule parsing:** Two seasonal tables (summer/winter). Import seasonally appropriate one based on current month.
- **Day names:** Spanish (`Lunes`, `Martes`, etc.) with range support (`Lunes a Viernes`)
- **Times:** 24-hour `HH:MMh` format (e.g., `08:00h`, `20:30h`)
- **No coordinates:** Churches created with `latitude: 0, longitude: 0` — geocoded separately
- **Geocoding:** Optional `--geocode` flag uses Nominatim public API (1 req/sec)
### Constants
```typescript
const SITE_BASE = 'https://horariosmisas.com';
const SITEMAP_INDEX_URL = `${SITE_BASE}/sitemap_index.xml`;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 1500;
const NOMINATIM_DELAY_MS = 1100;
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';
```
### Spanish Day Mapping
```typescript
const DAY_MAP: Record<string, number[]> = {
'domingos y festivos': [0],
'domingos': [0],
'domingo': [0],
'lunes': [1],
'martes': [2],
'miércoles': [3],
'miercoles': [3],
'jueves': [4],
'viernes': [5],
'sábado': [6],
'sabado': [6],
'sábados': [6],
'sabados': [6],
};
```
### Sitemap Fetching
1. Fetch sitemap index → extract `post-sitemap*.xml` URLs
2. Fetch each post sitemap → extract URLs with exactly 3 path segments
3. Filter out non-church URLs (patterns: `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, `/noticias/`, `/blog/`, `/contacto/`, `/aviso-legal/`, `/politica-de-privacidad/`, `/politica-de-cookies/`)
4. Deduplicate by slug
### HTML Parsing
**Church name:** `<h1>Church Name (City)</h1>` → strip `(City)` suffix
**Address:** `📌 <strong>Calle Goya, 26 28001 Madrid (Madrid)</strong>` → extract street, postal code (5-digit `\b\d{5}\b`), city (text after postal code), strip `(Province)` suffix
**Phone:** `<strong>Teléfono:</strong> <a href="tel:...">number</a>`
**Website:** `<strong>Página Web:</strong> <a href="url">...</a>`
**Schedule tables:** Find `<table>` elements with DÍA/HORARIO headers. Split by seasonal headings (☀️ verano / ⛄ invierno). Pick seasonally appropriate section (Oct-May = winter, Jun-Sep = summer). Parse `<td>` cells: first cell = day name(s), second cell = times. Times in `HH:MMh` format extracted via regex `(\d{1,2}):(\d{2})\s*h?`.
### Day Range Resolution
Support ranges like `Lunes a Viernes` → [1,2,3,4,5] and compound entries like `Lunes, Miércoles y Viernes` → [1,3,5].
### Geocoding (--geocode / --geocode-only)
Query Nominatim with: `{address}, Spain` → fallback to `{postalCode} {city}, Spain` → fallback to `{city}, Spain`. Use `countrycodes=es` parameter. Max 1 req/sec.
### Matching Strategy
1. `horariosMisasId` exact match (primary — for re-imports)
2. Name + proximity against existing Spanish OSM churches (secondary)
3. Unmatched: create new church with `latitude: 0, longitude: 0`, country=ES
### CLI
```
--all Import all churches from sitemaps
--province <name> Import only churches from this province
--dry-run No database writes
--geocode After import, geocode unmatched churches
--geocode-only Only geocode (skip import)
--resume-from <n> Skip first N churches
--job-id <uuid> Background job tracking
```
### Mass Schedule Language
Set `language: 'Spanish'` on all created mass schedules.
### Step 1: Create the file
Use `scripts/import-mass-schedules-ph.ts` as the structural template. Implement all functions described above.
### Step 2: Verify build
```bash
npx tsc --noEmit
```
### Step 3: Dry-run test
```bash
npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run
```
### Step 4: Commit
```bash
git add scripts/import-horariosmisas.ts
git commit -m "feat: add horariosmisas.com Spain church importer"
```
---
## Task 4: Add to Scheduler Pipeline and npm Scripts
**Files:**
- Modify: `scripts/scheduler.ts`
- Modify: `package.json`
### Step 1: Add to PIPELINE_GROUPS
In `scripts/scheduler.ts`, in the `imports` group (line ~40-51), add after the `philmass-import` entry:
```typescript
{ name: 'horariosmisas-import', type: 'horariosmisas-import', config: {} },
```
### Step 2: Add getJobCommand case
In the `getJobCommand` function (around line ~182), before the `default:` case, add:
```typescript
case 'horariosmisas-import': {
const args = ['tsx', 'scripts/import-horariosmisas.ts', '--all', '--geocode'];
if (config?.province) args.push('--province', String(config.province));
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
```
### Step 3: Add npm scripts
In `package.json`, add after the `"import:philmass"` line:
```json
"import:horariosmisas": "tsx scripts/import-horariosmisas.ts",
```
### Step 4: Verify build
```bash
npx tsc --noEmit
```
### Step 5: Commit
```bash
git add scripts/scheduler.ts package.json
git commit -m "feat: add horariosmisas import to scheduler pipeline"
```
---
## Verification
1. **Dry run on single province**: `npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run`
- Verify: church names parsed correctly, schedules extracted, matches found
2. **Dry run on Madrid**: `npx tsx scripts/import-horariosmisas.ts --province madrid --dry-run`
- Verify: larger province, summer/winter schedule selection, address parsing
3. **Single province real import**: `npx tsx scripts/import-horariosmisas.ts --province navarra`
- Verify: churches created/updated, mass schedules in database
4. **Geocode test**: `npx tsx scripts/import-horariosmisas.ts --geocode-only --dry-run`
- Verify: finds churches needing geocoding, Nominatim returns coordinates
5. **Full import**: `npx tsx scripts/import-horariosmisas.ts --all --geocode`
## Runtime Estimate
- Sitemap fetch: 20 sitemaps x 1.5s = ~30s
- Import: ~10,000 churches x 1.5s = ~4.2 hours
- Geocode: depends on unmatched count x 1.1s