Files
ScraperControl/docs/plans/2026-02-26-horariosmisas-spain.md
Albert 2c51513851 chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-12 19:11:22 -04:00

9.7 KiB

Spain Church Importer (horariosmisas.com) — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Import ~10,000 Spanish churches with mass schedules from horariosmisas.com, with optional Nominatim forward geocoding for unmatched churches.

Architecture: Sitemap-driven importer. Fetch 20 post sitemaps for church URLs, parse static WordPress HTML for names/addresses/schedule tables, match against existing Spanish OSM churches, upsert with mass schedules. Separate geocoding pass via Nominatim public API.

Tech Stack: TypeScript, Prisma, HTML parsing (regex — no Playwright), Nominatim geocoding API.


Task 1: Add horariosMisasId to Prisma Schema

Files:

  • Modify: prisma/schema.prisma

Step 1: Add field and index

After the philmassId line (around line 38), add:

horariosMisasId       String?   @unique @map("horarios_misas_id") // horariosmisas.com URL slug

And add an index in the @@index block (around line 78):

@@index([horariosMisasId])

Step 2: Push schema to NAS database

npx prisma db push --accept-data-loss

Expected: Your database is now in sync with your Prisma schema.

Step 3: Regenerate Prisma client

npx prisma generate

Step 4: Push schema to Neon production

npx prisma db push --url "$(grep DATABASE_URL .env.production | sed 's/DATABASE_URL="//' | sed 's/"$//')" --accept-data-loss

Step 5: Commit

git add prisma/schema.prisma
git commit -m "feat: add horariosMisasId to Church model for Spain import"

Task 2: Extend Church Matcher and Existing Importers

Files:

  • Modify: src/lib/church-matcher.ts
  • Modify: scripts/import-osm-churches.ts
  • Modify: scripts/import-gcatholic.ts
  • Modify: scripts/import-baidu-churches.ts
  • Modify: scripts/import-osm-region.ts
  • Modify: scripts/import-orarimesse.ts
  • Modify: scripts/import-mass-schedules-ph.ts
  • Modify: scripts/import-philmass.ts

Step 1: Update church-matcher.ts

In ExistingChurch interface (line ~11-26), add after philmassId:

horariosMisasId: string | null;

In ChurchCandidate type (line ~113-122), add after philmassId:

horariosMisasId?: string;

In findDuplicateChurch(), add a new pass after the fifth pass (philmassId match, line ~169-175). Before the proximity+name pass:

// Sixth pass: exact horariosMisasId match
if (candidate.horariosMisasId) {
  const horariosMisasMatch = existingChurches.find(
    (church) => church.horariosMisasId === candidate.horariosMisasId
  );
  if (horariosMisasMatch) return horariosMisasMatch;
}

Update the comment on the proximity pass to say "Seventh pass".

Step 2: Update all existing importers

In every importer that queries churches with a select clause containing philmassId: true, add:

horariosMisasId: true,

In every importer that creates/pushes churches with philmassId: null, add:

horariosMisasId: null,

Files to update: import-osm-churches.ts, import-gcatholic.ts, import-baidu-churches.ts, import-osm-region.ts, import-orarimesse.ts, import-mass-schedules-ph.ts, import-philmass.ts

Step 3: Verify build

npx tsc --noEmit

Expected: No errors.

Step 4: Commit

git add src/lib/church-matcher.ts scripts/import-*.ts
git commit -m "feat: add horariosMisasId to church matcher and all importers"

Task 3: Create import-horariosmisas.ts

Files:

  • Create: scripts/import-horariosmisas.ts

Architecture

This importer follows the exact same structure as scripts/import-mass-schedules-ph.ts. Key differences:

  • Sitemap: Fetches 20 post sitemaps from sitemap index (not a single sitemap)
  • URL filtering: Church URLs have 3 path segments (/{province}/{city}/{slug}/). Non-church URLs (blog posts, daily readings) are filtered out.
  • Schedule parsing: Two seasonal tables (summer/winter). Import seasonally appropriate one based on current month.
  • Day names: Spanish (Lunes, Martes, etc.) with range support (Lunes a Viernes)
  • Times: 24-hour HH:MMh format (e.g., 08:00h, 20:30h)
  • No coordinates: Churches created with latitude: 0, longitude: 0 — geocoded separately
  • Geocoding: Optional --geocode flag uses Nominatim public API (1 req/sec)

Constants

const SITE_BASE = 'https://horariosmisas.com';
const SITEMAP_INDEX_URL = `${SITE_BASE}/sitemap_index.xml`;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 1500;
const NOMINATIM_DELAY_MS = 1100;
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';

Spanish Day Mapping

const DAY_MAP: Record<string, number[]> = {
  'domingos y festivos': [0],
  'domingos': [0],
  'domingo': [0],
  'lunes': [1],
  'martes': [2],
  'miércoles': [3],
  'miercoles': [3],
  'jueves': [4],
  'viernes': [5],
  'sábado': [6],
  'sabado': [6],
  'sábados': [6],
  'sabados': [6],
};

Sitemap Fetching

  1. Fetch sitemap index → extract post-sitemap*.xml URLs
  2. Fetch each post sitemap → extract URLs with exactly 3 path segments
  3. Filter out non-church URLs (patterns: /misas-diarias/, /santos-del-dia/, /oraciones/, /noticias/, /blog/, /contacto/, /aviso-legal/, /politica-de-privacidad/, /politica-de-cookies/)
  4. Deduplicate by slug

HTML Parsing

Church name: <h1>Church Name (City)</h1> → strip (City) suffix

Address: 📌 <strong>Calle Goya, 26 28001 Madrid (Madrid)</strong> → extract street, postal code (5-digit \b\d{5}\b), city (text after postal code), strip (Province) suffix

Phone: <strong>Teléfono:</strong> <a href="tel:...">number</a>

Website: <strong>Página Web:</strong> <a href="url">...</a>

Schedule tables: Find <table> elements with DÍA/HORARIO headers. Split by seasonal headings (☀️ verano / invierno). Pick seasonally appropriate section (Oct-May = winter, Jun-Sep = summer). Parse <td> cells: first cell = day name(s), second cell = times. Times in HH:MMh format extracted via regex (\d{1,2}):(\d{2})\s*h?.

Day Range Resolution

Support ranges like Lunes a Viernes → [1,2,3,4,5] and compound entries like Lunes, Miércoles y Viernes → [1,3,5].

Geocoding (--geocode / --geocode-only)

Query Nominatim with: {address}, Spain → fallback to {postalCode} {city}, Spain → fallback to {city}, Spain. Use countrycodes=es parameter. Max 1 req/sec.

Matching Strategy

  1. horariosMisasId exact match (primary — for re-imports)
  2. Name + proximity against existing Spanish OSM churches (secondary)
  3. Unmatched: create new church with latitude: 0, longitude: 0, country=ES

CLI

--all                  Import all churches from sitemaps
--province <name>      Import only churches from this province
--dry-run              No database writes
--geocode              After import, geocode unmatched churches
--geocode-only         Only geocode (skip import)
--resume-from <n>      Skip first N churches
--job-id <uuid>        Background job tracking

Mass Schedule Language

Set language: 'Spanish' on all created mass schedules.

Step 1: Create the file

Use scripts/import-mass-schedules-ph.ts as the structural template. Implement all functions described above.

Step 2: Verify build

npx tsc --noEmit

Step 3: Dry-run test

npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run

Step 4: Commit

git add scripts/import-horariosmisas.ts
git commit -m "feat: add horariosmisas.com Spain church importer"

Task 4: Add to Scheduler Pipeline and npm Scripts

Files:

  • Modify: scripts/scheduler.ts
  • Modify: package.json

Step 1: Add to PIPELINE_GROUPS

In scripts/scheduler.ts, in the imports group (line ~40-51), add after the philmass-import entry:

{ name: 'horariosmisas-import', type: 'horariosmisas-import', config: {} },

Step 2: Add getJobCommand case

In the getJobCommand function (around line ~182), before the default: case, add:

case 'horariosmisas-import': {
  const args = ['tsx', 'scripts/import-horariosmisas.ts', '--all', '--geocode'];
  if (config?.province) args.push('--province', String(config.province));
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}

Step 3: Add npm scripts

In package.json, add after the "import:philmass" line:

"import:horariosmisas": "tsx scripts/import-horariosmisas.ts",

Step 4: Verify build

npx tsc --noEmit

Step 5: Commit

git add scripts/scheduler.ts package.json
git commit -m "feat: add horariosmisas import to scheduler pipeline"

Verification

  1. Dry run on single province: npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run
    • Verify: church names parsed correctly, schedules extracted, matches found
  2. Dry run on Madrid: npx tsx scripts/import-horariosmisas.ts --province madrid --dry-run
    • Verify: larger province, summer/winter schedule selection, address parsing
  3. Single province real import: npx tsx scripts/import-horariosmisas.ts --province navarra
    • Verify: churches created/updated, mass schedules in database
  4. Geocode test: npx tsx scripts/import-horariosmisas.ts --geocode-only --dry-run
    • Verify: finds churches needing geocoding, Nominatim returns coordinates
  5. Full import: npx tsx scripts/import-horariosmisas.ts --all --geocode

Runtime Estimate

  • Sitemap fetch: 20 sitemaps x 1.5s = ~30s
  • Import: ~10,000 churches x 1.5s = ~4.2 hours
  • Geocode: depends on unmatched count x 1.1s