chore: sync with Gitea master and restore local-only files

Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-12 19:11:22 -04:00
parent 76cca3ba75
commit 2c51513851
133 changed files with 30381 additions and 0 deletions
--- a/docs/plans/2026-02-25-parallel-scrapers-design.md
+++ b/docs/plans/2026-02-25-parallel-scrapers-design.md
@@ -0,0 +1,43 @@
+# Parallel Scrapers with Country Mapping Fix
+
+## Problem
+
+The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.
+
+## Changes
+
+### 1. Country Mapping Additions (scraper-service.ts)
+
+Add to `COUNTRY_SCRAPER_MAP`:
+- **English**: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
+- **French**: BE, LU
+- **German**: CH, SI
+- **Italian**: HR, RO
+
+### 2. Parallel Pipeline Groups (scheduler.ts)
+
+Replace sequential `PIPELINE_PHASES` array with grouped phases:
+
+| Group | Phases | Concurrency |
+|-------|--------|-------------|
+| 1 | osm-import, gcatholic-import | Sequential (shared data) |
+| 2 | english, french, german | Parallel (3) |
+| 3 | polish, spanish, italian | Parallel (3) |
+| 4 | portuguese, czech, dutch | Parallel (3) |
+| 5 | hungarian, generic | Parallel (2) |
+
+Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.
+
+### 3. Generic Scraper Deprioritized
+
+- Moved to last group
+- Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)
+
+### 4. Resource Changes
+
+- Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
+- No new Docker containers or compose changes needed — existing child process spawning approach is kept
+
+## Approach
+
+Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns `npx tsx` processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.
--- a/docs/plans/2026-02-25-parallel-scrapers.md
+++ b/docs/plans/2026-02-25-parallel-scrapers.md
@@ -0,0 +1,423 @@
+# Parallel Scrapers Implementation Plan
+
+> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
+
+**Goal:** Run language scrapers in parallel groups of 3, add missing country mappings, and deprioritize the generic scraper.
+
+**Architecture:** Replace sequential pipeline phases with grouped phases. Groups run their jobs concurrently (max 3), then wait for all to complete before advancing. Import phases stay sequential. The scheduler tracks a `groupJobsRemaining` counter per group instead of advancing on every job completion.
+
+**Tech Stack:** TypeScript, node child_process spawn, Prisma, Docker Compose
+
+---
+
+### Task 1: Add Missing Country Mappings
+
+**Files:**
+- Modify: `src/lib/scraper-service.ts:29-45`
+
+**Step 1: Update COUNTRY_SCRAPER_MAP**
+
+Add these entries to the existing `COUNTRY_SCRAPER_MAP` object at `src/lib/scraper-service.ts:29`:
+
+```typescript
+const COUNTRY_SCRAPER_MAP: Record<string, string> = {
+  US: 'english', CA: 'english', GB: 'english',
+  AU: 'english', NZ: 'english', IE: 'english', PH: 'english',
+  IN: 'english', SG: 'english', MY: 'english', KE: 'english',
+  JM: 'english', TT: 'english', GH: 'english', NG: 'english',
+  ZA: 'english', TZ: 'english', UG: 'english',
+  FR: 'french', BE: 'french', LU: 'french',
+  ES: 'spanish', MX: 'spanish', AR: 'spanish', CO: 'spanish',
+  CL: 'spanish', PE: 'spanish', EC: 'spanish', VE: 'spanish',
+  CR: 'spanish', PA: 'spanish', GT: 'spanish', CU: 'spanish',
+  HN: 'spanish', SV: 'spanish', NI: 'spanish', BO: 'spanish',
+  PY: 'spanish', UY: 'spanish', DO: 'spanish',
+  IT: 'italian', SM: 'italian', VA: 'italian',
+  HR: 'italian', RO: 'italian',
+  DE: 'german', AT: 'german', LI: 'german',
+  CH: 'german', SI: 'german',
+  PL: 'polish',
+  PT: 'portuguese', BR: 'portuguese',
+  NL: 'dutch',
+  CZ: 'czech', SK: 'czech',
+  HU: 'hungarian',
+};
+```
+
+Also update `buildLanguageFilter` at `src/lib/scraper-service.ts:346-463` to include the new countries in each language filter's country list:
+
+- `english` filter (line 356): add `'IN', 'SG', 'MY', 'KE', 'JM', 'TT', 'GH', 'NG', 'ZA', 'TZ', 'UG'`
+- `french` filter (line 366): add `'BE', 'LU'` → `{ in: ['FR', 'BE', 'LU'] }`
+- `spanish` filter: already has all needed countries
+- `italian` filter (line 387): add `'HR', 'RO'` → `{ in: ['IT', 'SM', 'VA', 'HR', 'RO'] }`
+- `german` filter (line 397): add `'CH', 'SI'` → `{ in: ['DE', 'AT', 'LI', 'CH', 'SI'] }`
+
+**Step 2: Verify build**
+
+Run: `npm run build`
+Expected: Build succeeds with no errors
+
+**Step 3: Commit**
+
+```bash
+git add src/lib/scraper-service.ts
+git commit -m "feat: add missing country mappings to language scrapers
+
+Add BE/LU→french, CH/SI→german, HR/RO→italian, IN/SG/MY/KE/JM/TT/GH/NG/ZA/TZ/UG→english.
+~1,400 previously unmapped churches now routed to proper language scrapers."
+```
+
+---
+
+### Task 2: Rewrite Scheduler for Parallel Groups
+
+**Files:**
+- Modify: `scripts/scheduler.ts`
+
+**Step 1: Replace pipeline data structure**
+
+Replace the `PipelinePhase` interface, `PIPELINE_PHASES` array (lines 27-49), and `CycleState` interface (lines 53-69) with:
+
+```typescript
+interface PipelinePhase {
+  name: string;
+  type: string;
+  language?: string;
+  config: Record<string, unknown>;
+}
+
+interface PipelineGroup {
+  name: string;
+  phases: PipelinePhase[];
+  mode: 'sequential' | 'parallel';
+}
+
+const PIPELINE_GROUPS: PipelineGroup[] = [
+  {
+    name: 'imports',
+    mode: 'sequential',
+    phases: [
+      { name: 'osm-import-p1', type: 'osm-import', config: { priority: 1 } },
+      { name: 'gcatholic-import', type: 'gcatholic-import', config: { delay: 2000 } },
+    ],
+  },
+  {
+    name: 'scrapers-batch-1',
+    mode: 'parallel',
+    phases: [
+      { name: 'scraper-english', type: 'scraper', language: 'english', config: { allMode: true, maxFailures: 10, language: 'english' } },
+      { name: 'scraper-french', type: 'scraper', language: 'french', config: { allMode: true, maxFailures: 10, language: 'french' } },
+      { name: 'scraper-german', type: 'scraper', language: 'german', config: { allMode: true, maxFailures: 10, language: 'german' } },
+    ],
+  },
+  {
+    name: 'scrapers-batch-2',
+    mode: 'parallel',
+    phases: [
+      { name: 'scraper-polish', type: 'scraper', language: 'polish', config: { allMode: true, maxFailures: 10, language: 'polish' } },
+      { name: 'scraper-spanish', type: 'scraper', language: 'spanish', config: { allMode: true, maxFailures: 10, language: 'spanish' } },
+      { name: 'scraper-italian', type: 'scraper', language: 'italian', config: { allMode: true, maxFailures: 10, language: 'italian' } },
+    ],
+  },
+  {
+    name: 'scrapers-batch-3',
+    mode: 'parallel',
+    phases: [
+      { name: 'scraper-portuguese', type: 'scraper', language: 'portuguese', config: { allMode: true, maxFailures: 10, language: 'portuguese' } },
+      { name: 'scraper-czech', type: 'scraper', language: 'czech', config: { allMode: true, maxFailures: 10, language: 'czech' } },
+      { name: 'scraper-dutch', type: 'scraper', language: 'dutch', config: { allMode: true, maxFailures: 10, language: 'dutch' } },
+    ],
+  },
+  {
+    name: 'scrapers-batch-4',
+    mode: 'parallel',
+    phases: [
+      { name: 'scraper-hungarian', type: 'scraper', language: 'hungarian', config: { allMode: true, maxFailures: 10, language: 'hungarian' } },
+      { name: 'scraper-generic', type: 'scraper', language: 'generic', config: { allMode: true, maxFailures: 10, language: 'generic' } },
+    ],
+  },
+];
+```
+
+**Step 2: Replace CycleState**
+
+```typescript
+interface CycleState {
+  currentGroupIndex: number;
+  currentSequentialPhaseIndex: number; // for sequential groups, tracks which phase within the group
+  cycleNumber: number;
+  cycleStartedAt: Date | null;
+  lastCycleCompletedAt: Date | null;
+  waitingForCooldown: boolean;
+  activeGroupJobs: number; // how many jobs still running in the current group
+}
+
+const cycleState: CycleState = {
+  currentGroupIndex: 0,
+  currentSequentialPhaseIndex: 0,
+  cycleNumber: 0,
+  cycleStartedAt: null,
+  lastCycleCompletedAt: null,
+  waitingForCooldown: false,
+  activeGroupJobs: 0,
+};
+```
+
+**Step 3: Rewrite pollAndAdvancePipeline**
+
+Replace the entire `pollAndAdvancePipeline` function (lines 306-385) and `advancePipelinePhase` function (lines 387-390) with:
+
+```typescript
+async function pollAndAdvancePipeline(): Promise<void> {
+  try {
+    // 1. Check for manual pending jobs from admin API (priority over pipeline)
+    if (runningJobs.size === 0) {
+      const manualJob = await prisma.backgroundJob.findFirst({
+        where: {
+          status: 'pending',
+          NOT: { config: { path: ['pipelineManaged'], equals: true } },
+        },
+        orderBy: { createdAt: 'asc' },
+      });
+
+      if (manualJob) {
+        log(`Found manual job: ${manualJob.type}${manualJob.language ? `:${manualJob.language}` : ''} (${manualJob.id})`);
+        await startJobProcess(
+          manualJob.id,
+          manualJob.type,
+          manualJob.language,
+          manualJob.config as Record<string, unknown> | null
+        );
+        return;
+      }
+    }
+
+    // 2. If jobs are still running for the current group, wait
+    if (cycleState.activeGroupJobs > 0) {
+      return;
+    }
+
+    // 3. If in cooldown, check if expired
+    if (cycleState.waitingForCooldown) {
+      if (cycleState.lastCycleCompletedAt) {
+        const elapsed = Date.now() - cycleState.lastCycleCompletedAt.getTime();
+        if (elapsed < CYCLE_COOLDOWN_MS) {
+          const remaining = Math.round((CYCLE_COOLDOWN_MS - elapsed) / 60_000);
+          if (remaining % 30 === 0 || remaining <= 5) {
+            log(`Cooldown: ${remaining} minutes remaining before next cycle`);
+          }
+          return;
+        }
+      }
+      cycleState.waitingForCooldown = false;
+      cycleState.currentGroupIndex = 0;
+      cycleState.currentSequentialPhaseIndex = 0;
+      log('Cooldown expired, starting new cycle');
+    }
+
+    // 4. If past the last group, complete the cycle
+    if (cycleState.currentGroupIndex >= PIPELINE_GROUPS.length) {
+      cycleState.cycleNumber++;
+      cycleState.lastCycleCompletedAt = new Date();
+      cycleState.waitingForCooldown = true;
+      const cooldownHours = CYCLE_COOLDOWN_MS / (60 * 60 * 1000);
+      log(`=== Cycle ${cycleState.cycleNumber} complete! Entering ${cooldownHours}h cooldown ===`);
+      return;
+    }
+
+    // 5. Start the current group
+    const group = PIPELINE_GROUPS[cycleState.currentGroupIndex];
+
+    if (cycleState.currentGroupIndex === 0 && cycleState.currentSequentialPhaseIndex === 0 && !cycleState.cycleStartedAt) {
+      cycleState.cycleStartedAt = new Date();
+      log(`=== Starting cycle ${cycleState.cycleNumber + 1} ===`);
+    }
+
+    if (group.mode === 'parallel') {
+      // Launch all phases in the group concurrently
+      log(`Pipeline group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length}: ${group.name} (parallel, ${group.phases.length} jobs)`);
+      cycleState.activeGroupJobs = group.phases.length;
+
+      for (const phase of group.phases) {
+        const jobId = await createPendingJob(
+          phase.type,
+          phase.language,
+          { ...phase.config, pipelineManaged: true }
+        );
+        await startJobProcess(jobId, phase.type, phase.language || null, phase.config);
+      }
+    } else {
+      // Sequential: run one phase at a time within the group
+      const phaseIndex = cycleState.currentSequentialPhaseIndex;
+      if (phaseIndex >= group.phases.length) {
+        // All phases in this sequential group are done
+        cycleState.currentGroupIndex++;
+        cycleState.currentSequentialPhaseIndex = 0;
+        return; // Will pick up next group on next poll
+      }
+
+      const phase = group.phases[phaseIndex];
+      log(`Pipeline group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length}: ${group.name} (sequential ${phaseIndex + 1}/${group.phases.length}: ${phase.name})`);
+      cycleState.activeGroupJobs = 1;
+
+      const jobId = await createPendingJob(
+        phase.type,
+        phase.language,
+        { ...phase.config, pipelineManaged: true }
+      );
+      await startJobProcess(jobId, phase.type, phase.language || null, phase.config);
+    }
+  } catch (err) {
+    logError(`Error in pipeline: ${err}`);
+  }
+}
+
+function onJobCompleted(): void {
+  cycleState.activeGroupJobs--;
+
+  if (cycleState.activeGroupJobs <= 0) {
+    cycleState.activeGroupJobs = 0;
+    const group = PIPELINE_GROUPS[cycleState.currentGroupIndex];
+
+    if (group?.mode === 'sequential') {
+      cycleState.currentSequentialPhaseIndex++;
+      // Check if there are more phases in this sequential group
+      if (cycleState.currentSequentialPhaseIndex < group.phases.length) {
+        return; // Don't advance group yet
+      }
+    }
+
+    // Advance to next group
+    cycleState.currentGroupIndex++;
+    cycleState.currentSequentialPhaseIndex = 0;
+    log(`Group "${group?.name}" complete, advancing to group ${cycleState.currentGroupIndex + 1}`);
+  }
+}
+```
+
+**Step 4: Update startJobProcess callbacks**
+
+In the `child.on('close')` callback (line 442) and `child.on('error')` callback (line 472), replace `advancePipelinePhase()` with `onJobCompleted()`.
+
+**Step 5: Update crash recovery**
+
+In `recoverFromCrash` (lines 259-268), replace the `PIPELINE_PHASES.findIndex` logic with a search through `PIPELINE_GROUPS`:
+
+```typescript
+  if (lastRunningPipelineJob) {
+    for (let gi = 0; gi < PIPELINE_GROUPS.length; gi++) {
+      const group = PIPELINE_GROUPS[gi];
+      const phaseIdx = group.phases.findIndex(
+        p => p.type === lastRunningPipelineJob.type &&
+          (p.language || null) === (lastRunningPipelineJob.language || null)
+      );
+      if (phaseIdx >= 0) {
+        cycleState.currentGroupIndex = gi;
+        cycleState.currentSequentialPhaseIndex = group.mode === 'sequential' ? phaseIdx : 0;
+        log(`Resuming pipeline from group ${gi + 1}: ${group.name}`);
+        break;
+      }
+    }
+  }
+```
+
+**Step 6: Update heartbeat log in main()**
+
+Replace the heartbeat cron (lines 551-562) and the startup log (lines 574-580) to reference groups instead of phases:
+
+```typescript
+  cron.schedule('0 * * * *', () => {
+    const currentGroup = cycleState.currentGroupIndex < PIPELINE_GROUPS.length
+      ? PIPELINE_GROUPS[cycleState.currentGroupIndex].name
+      : 'none';
+    const jobs = runningJobs.size > 0
+      ? `Running: ${[...runningJobs.keys()].join(', ')}`
+      : 'No jobs running';
+    const state = cycleState.waitingForCooldown
+      ? 'cooldown'
+      : `group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length} (${currentGroup})`;
+    log(`Heartbeat: Cycle ${cycleState.cycleNumber + 1}, ${state}. ${jobs}`);
+  }, { timezone: 'UTC' });
+```
+
+For the startup log:
+
+```typescript
+  log('=== Scheduler running (parallel grouped pipeline) ===');
+  log(`Pipeline groups (${PIPELINE_GROUPS.length}):`);
+  for (let i = 0; i < PIPELINE_GROUPS.length; i++) {
+    const g = PIPELINE_GROUPS[i];
+    const phaseNames = g.phases.map(p => p.name).join(', ');
+    log(`  ${i + 1}. ${g.name} [${g.mode}]: ${phaseNames}`);
+  }
+```
+
+**Step 7: Remove dead Google Places env log**
+
+Delete lines 167-169 (the `GOOGLE_PLACES_API_KEY` log in `validateEnvironment`).
+
+**Step 8: Verify build**
+
+Run: `npm run build`
+Expected: Build succeeds
+
+**Step 9: Commit**
+
+```bash
+git add scripts/scheduler.ts
+git commit -m "feat: parallel grouped pipeline scheduler
+
+Replace sequential pipeline with grouped phases. Import phases run
+sequentially, scraper phases run in parallel groups of 3. This reduces
+cycle time from days to hours. Generic scraper moved to last group."
+```
+
+---
+
+### Task 3: Increase Scheduler Memory Limit
+
+**Files:**
+- Modify: `docker-compose.yml:217-220`
+
+**Step 1: Increase memory limit**
+
+Change the scheduler service's `deploy.resources.limits.memory` from `4G` to `10G`:
+
+```yaml
+    deploy:
+      resources:
+        limits:
+          memory: 10G
+```
+
+**Step 2: Commit**
+
+```bash
+git add docker-compose.yml
+git commit -m "chore: increase scheduler memory to 10G for parallel scrapers"
+```
+
+---
+
+### Task 4: Deploy and Verify
+
+**Step 1: Deploy to NAS**
+
+```bash
+rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
+  /Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/
+```
+
+**Step 2: Rebuild and restart scheduler**
+
+```bash
+ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scheduler && /usr/local/bin/docker compose up -d scheduler'
+```
+
+**Step 3: Verify logs show parallel groups**
+
+```bash
+ssh albert@192.168.0.145 '/usr/local/bin/docker logs --tail 30 scraper-control-scheduler-1'
+```
+
+Expected: Logs show "parallel grouped pipeline", group listings with `[parallel]` and `[sequential]` tags, and eventually multiple concurrent `Running:` entries in heartbeat.
--- a/docs/plans/2026-02-26-horariosmisas-spain-design.md
+++ b/docs/plans/2026-02-26-horariosmisas-spain-design.md
@@ -0,0 +1,72 @@
+# Spain Church Importer (horariosmisas.com) — Design
+
+## Overview
+
+Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
+
+## Data Source
+
+- **Site:** https://horariosmisas.com
+- **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
+- **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
+- **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass.
+- **robots.txt:** Fully permissive (`User-agent: * / Disallow:`)
+- **Sitemaps:** 20 post sitemaps + 7 category sitemaps
+
+## Architecture
+
+### Two-Pass Approach
+
+**Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
+
+**Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit.
+
+### Schema Change
+
+Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers.
+
+### URL Structure
+
+- Sitemap index: `/sitemap_index.xml` → 20 post sitemaps
+- Church pages: `/{province}/{city}/{church-slug}/`
+- Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc.
+
+### HTML Parsing
+
+- **Name:** `<h1>Church Name (City)</h1>` — strip `(City)` suffix
+- **Address:** `<p>📌 <strong>Street, PostalCode City (Province)</strong></p>`
+- **Phone:** `<strong>Teléfono:</strong> <a href="tel:...">...</a>`
+- **Website:** `<strong>Página Web:</strong> <a href="...">...</a>`
+- **Schedule:** `<table>` with `DÍA`/`HORARIO` columns
+  - Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno`
+  - Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
+  - Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
+  - Day ranges: "Lunes a Viernes" (Monday-Friday)
+  - Time format: `HH:MMh` (24-hour), multiple per cell via `<br>`
+  - Annotations stripped: `(familias)`, etc.
+
+### Matching Strategy
+
+1. `horariosMisasId` exact match (for re-imports)
+2. Name + proximity against existing Spanish churches (from OSM)
+3. Unmatched: create new church with address, country=ES, no coordinates
+
+### CLI
+
+```
+npx tsx scripts/import-horariosmisas.ts --all
+npx tsx scripts/import-horariosmisas.ts --all --dry-run
+npx tsx scripts/import-horariosmisas.ts --province madrid
+npx tsx scripts/import-horariosmisas.ts --all --geocode
+npx tsx scripts/import-horariosmisas.ts --geocode-only
+npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
+```
+
+### Rate Limiting
+
+- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
+- Geocode: 1s between requests (Nominatim public API limit)
+
+### Scheduler Integration
+
+Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).
--- a/docs/plans/2026-02-26-horariosmisas-spain.md
+++ b/docs/plans/2026-02-26-horariosmisas-spain.md
@@ -0,0 +1,322 @@
+# Spain Church Importer (horariosmisas.com) — Implementation Plan
+
+> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
+
+**Goal:** Import ~10,000 Spanish churches with mass schedules from horariosmisas.com, with optional Nominatim forward geocoding for unmatched churches.
+
+**Architecture:** Sitemap-driven importer. Fetch 20 post sitemaps for church URLs, parse static WordPress HTML for names/addresses/schedule tables, match against existing Spanish OSM churches, upsert with mass schedules. Separate geocoding pass via Nominatim public API.
+
+**Tech Stack:** TypeScript, Prisma, HTML parsing (regex — no Playwright), Nominatim geocoding API.
+
+---
+
+## Task 1: Add `horariosMisasId` to Prisma Schema
+
+**Files:**
+- Modify: `prisma/schema.prisma`
+
+**Step 1: Add field and index**
+
+After the `philmassId` line (around line 38), add:
+
+```prisma
+horariosMisasId       String?   @unique @map("horarios_misas_id") // horariosmisas.com URL slug
+```
+
+And add an index in the `@@index` block (around line 78):
+
+```prisma
+@@index([horariosMisasId])
+```
+
+**Step 2: Push schema to NAS database**
+
+```bash
+npx prisma db push --accept-data-loss
+```
+
+Expected: `Your database is now in sync with your Prisma schema.`
+
+**Step 3: Regenerate Prisma client**
+
+```bash
+npx prisma generate
+```
+
+**Step 4: Push schema to Neon production**
+
+```bash
+npx prisma db push --url "$(grep DATABASE_URL .env.production | sed 's/DATABASE_URL="//' | sed 's/"$//')" --accept-data-loss
+```
+
+**Step 5: Commit**
+
+```bash
+git add prisma/schema.prisma
+git commit -m "feat: add horariosMisasId to Church model for Spain import"
+```
+
+---
+
+## Task 2: Extend Church Matcher and Existing Importers
+
+**Files:**
+- Modify: `src/lib/church-matcher.ts`
+- Modify: `scripts/import-osm-churches.ts`
+- Modify: `scripts/import-gcatholic.ts`
+- Modify: `scripts/import-baidu-churches.ts`
+- Modify: `scripts/import-osm-region.ts`
+- Modify: `scripts/import-orarimesse.ts`
+- Modify: `scripts/import-mass-schedules-ph.ts`
+- Modify: `scripts/import-philmass.ts`
+
+### Step 1: Update church-matcher.ts
+
+In `ExistingChurch` interface (line ~11-26), add after `philmassId`:
+
+```typescript
+horariosMisasId: string | null;
+```
+
+In `ChurchCandidate` type (line ~113-122), add after `philmassId`:
+
+```typescript
+horariosMisasId?: string;
+```
+
+In `findDuplicateChurch()`, add a new pass after the fifth pass (philmassId match, line ~169-175). Before the proximity+name pass:
+
+```typescript
+// Sixth pass: exact horariosMisasId match
+if (candidate.horariosMisasId) {
+  const horariosMisasMatch = existingChurches.find(
+    (church) => church.horariosMisasId === candidate.horariosMisasId
+  );
+  if (horariosMisasMatch) return horariosMisasMatch;
+}
+```
+
+Update the comment on the proximity pass to say "Seventh pass".
+
+### Step 2: Update all existing importers
+
+In every importer that queries churches with a `select` clause containing `philmassId: true`, add:
+
+```typescript
+horariosMisasId: true,
+```
+
+In every importer that creates/pushes churches with `philmassId: null`, add:
+
+```typescript
+horariosMisasId: null,
+```
+
+**Files to update:** `import-osm-churches.ts`, `import-gcatholic.ts`, `import-baidu-churches.ts`, `import-osm-region.ts`, `import-orarimesse.ts`, `import-mass-schedules-ph.ts`, `import-philmass.ts`
+
+### Step 3: Verify build
+
+```bash
+npx tsc --noEmit
+```
+
+Expected: No errors.
+
+### Step 4: Commit
+
+```bash
+git add src/lib/church-matcher.ts scripts/import-*.ts
+git commit -m "feat: add horariosMisasId to church matcher and all importers"
+```
+
+---
+
+## Task 3: Create `import-horariosmisas.ts`
+
+**Files:**
+- Create: `scripts/import-horariosmisas.ts`
+
+### Architecture
+
+This importer follows the exact same structure as `scripts/import-mass-schedules-ph.ts`. Key differences:
+
+- **Sitemap:** Fetches 20 post sitemaps from sitemap index (not a single sitemap)
+- **URL filtering:** Church URLs have 3 path segments (`/{province}/{city}/{slug}/`). Non-church URLs (blog posts, daily readings) are filtered out.
+- **Schedule parsing:** Two seasonal tables (summer/winter). Import seasonally appropriate one based on current month.
+- **Day names:** Spanish (`Lunes`, `Martes`, etc.) with range support (`Lunes a Viernes`)
+- **Times:** 24-hour `HH:MMh` format (e.g., `08:00h`, `20:30h`)
+- **No coordinates:** Churches created with `latitude: 0, longitude: 0` — geocoded separately
+- **Geocoding:** Optional `--geocode` flag uses Nominatim public API (1 req/sec)
+
+### Constants
+
+```typescript
+const SITE_BASE = 'https://horariosmisas.com';
+const SITEMAP_INDEX_URL = `${SITE_BASE}/sitemap_index.xml`;
+const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
+const REQUEST_DELAY_MS = 1500;
+const NOMINATIM_DELAY_MS = 1100;
+const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';
+```
+
+### Spanish Day Mapping
+
+```typescript
+const DAY_MAP: Record<string, number[]> = {
+  'domingos y festivos': [0],
+  'domingos': [0],
+  'domingo': [0],
+  'lunes': [1],
+  'martes': [2],
+  'miércoles': [3],
+  'miercoles': [3],
+  'jueves': [4],
+  'viernes': [5],
+  'sábado': [6],
+  'sabado': [6],
+  'sábados': [6],
+  'sabados': [6],
+};
+```
+
+### Sitemap Fetching
+
+1. Fetch sitemap index → extract `post-sitemap*.xml` URLs
+2. Fetch each post sitemap → extract URLs with exactly 3 path segments
+3. Filter out non-church URLs (patterns: `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, `/noticias/`, `/blog/`, `/contacto/`, `/aviso-legal/`, `/politica-de-privacidad/`, `/politica-de-cookies/`)
+4. Deduplicate by slug
+
+### HTML Parsing
+
+**Church name:** `<h1>Church Name (City)</h1>` → strip `(City)` suffix
+
+**Address:** `📌 <strong>Calle Goya, 26 28001 Madrid (Madrid)</strong>` → extract street, postal code (5-digit `\b\d{5}\b`), city (text after postal code), strip `(Province)` suffix
+
+**Phone:** `<strong>Teléfono:</strong> <a href="tel:...">number</a>`
+
+**Website:** `<strong>Página Web:</strong> <a href="url">...</a>`
+
+**Schedule tables:** Find `<table>` elements with DÍA/HORARIO headers. Split by seasonal headings (☀️ verano / ⛄ invierno). Pick seasonally appropriate section (Oct-May = winter, Jun-Sep = summer). Parse `<td>` cells: first cell = day name(s), second cell = times. Times in `HH:MMh` format extracted via regex `(\d{1,2}):(\d{2})\s*h?`.
+
+### Day Range Resolution
+
+Support ranges like `Lunes a Viernes` → [1,2,3,4,5] and compound entries like `Lunes, Miércoles y Viernes` → [1,3,5].
+
+### Geocoding (--geocode / --geocode-only)
+
+Query Nominatim with: `{address}, Spain` → fallback to `{postalCode} {city}, Spain` → fallback to `{city}, Spain`. Use `countrycodes=es` parameter. Max 1 req/sec.
+
+### Matching Strategy
+
+1. `horariosMisasId` exact match (primary — for re-imports)
+2. Name + proximity against existing Spanish OSM churches (secondary)
+3. Unmatched: create new church with `latitude: 0, longitude: 0`, country=ES
+
+### CLI
+
+```
+--all                  Import all churches from sitemaps
+--province <name>      Import only churches from this province
+--dry-run              No database writes
+--geocode              After import, geocode unmatched churches
+--geocode-only         Only geocode (skip import)
+--resume-from <n>      Skip first N churches
+--job-id <uuid>        Background job tracking
+```
+
+### Mass Schedule Language
+
+Set `language: 'Spanish'` on all created mass schedules.
+
+### Step 1: Create the file
+
+Use `scripts/import-mass-schedules-ph.ts` as the structural template. Implement all functions described above.
+
+### Step 2: Verify build
+
+```bash
+npx tsc --noEmit
+```
+
+### Step 3: Dry-run test
+
+```bash
+npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run
+```
+
+### Step 4: Commit
+
+```bash
+git add scripts/import-horariosmisas.ts
+git commit -m "feat: add horariosmisas.com Spain church importer"
+```
+
+---
+
+## Task 4: Add to Scheduler Pipeline and npm Scripts
+
+**Files:**
+- Modify: `scripts/scheduler.ts`
+- Modify: `package.json`
+
+### Step 1: Add to PIPELINE_GROUPS
+
+In `scripts/scheduler.ts`, in the `imports` group (line ~40-51), add after the `philmass-import` entry:
+
+```typescript
+{ name: 'horariosmisas-import', type: 'horariosmisas-import', config: {} },
+```
+
+### Step 2: Add getJobCommand case
+
+In the `getJobCommand` function (around line ~182), before the `default:` case, add:
+
+```typescript
+case 'horariosmisas-import': {
+  const args = ['tsx', 'scripts/import-horariosmisas.ts', '--all', '--geocode'];
+  if (config?.province) args.push('--province', String(config.province));
+  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
+  return { command: 'npx', args };
+}
+```
+
+### Step 3: Add npm scripts
+
+In `package.json`, add after the `"import:philmass"` line:
+
+```json
+"import:horariosmisas": "tsx scripts/import-horariosmisas.ts",
+```
+
+### Step 4: Verify build
+
+```bash
+npx tsc --noEmit
+```
+
+### Step 5: Commit
+
+```bash
+git add scripts/scheduler.ts package.json
+git commit -m "feat: add horariosmisas import to scheduler pipeline"
+```
+
+---
+
+## Verification
+
+1. **Dry run on single province**: `npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run`
+   - Verify: church names parsed correctly, schedules extracted, matches found
+2. **Dry run on Madrid**: `npx tsx scripts/import-horariosmisas.ts --province madrid --dry-run`
+   - Verify: larger province, summer/winter schedule selection, address parsing
+3. **Single province real import**: `npx tsx scripts/import-horariosmisas.ts --province navarra`
+   - Verify: churches created/updated, mass schedules in database
+4. **Geocode test**: `npx tsx scripts/import-horariosmisas.ts --geocode-only --dry-run`
+   - Verify: finds churches needing geocoding, Nominatim returns coordinates
+5. **Full import**: `npx tsx scripts/import-horariosmisas.ts --all --geocode`
+
+## Runtime Estimate
+
+- Sitemap fetch: 20 sitemaps x 1.5s = ~30s
+- Import: ~10,000 churches x 1.5s = ~4.2 hours
+- Geocode: depends on unmatched count x 1.1s
--- a/docs/plans/2026-03-01-weekdaymasses-importer-design.md
+++ b/docs/plans/2026-03-01-weekdaymasses-importer-design.md
@@ -0,0 +1,103 @@
+# weekdaymasses.org.uk Global Importer
+
+## Context
+
+weekdaymasses.org.uk is a UK-based Catholic directory covering ~3,500-4,000 churches globally with mass schedules, coordinates, addresses, and phone numbers. Covers GB, Ireland, and 49+ international countries (India, Sri Lanka, South Korea, Japan, and more). All data served on single HTML pages per area — no pagination or API needed.
+
+## Data Source
+
+Three area pages cover the entire site:
+
+| Page | URL | Est. Churches |
+|------|-----|---------------|
+| GB | `/en/area/gb/churches` | ~3,000+ |
+| Ireland | `/en/area/ireland/churches` | ~300+ |
+| Outside GB | `/en/area/outside-gb/churches` | ~152+ |
+
+Individual country/region pages (e.g. `/en/area/india/churches`) are subsets of these three.
+
+### Data per church
+
+- **Name**: h3 heading, format "Church Name (Location)"
+- **Address**: plain text after mass times, with postal/zip code
+- **Coordinates**: in map link query params `lat=XX.XXXX&lon=YY.YYYY&church_id=NNNNN`
+- **Mass times**: format `Day: HH.MMam/pm(Language), HH.MMam/pm(Language)`
+- **Phone**: `Tel: +XX XXXX XXXXXX`
+- **Website**: occasional links
+- **church_id**: unique numeric identifier in map links
+
+### Mass time format
+
+```
+Sunday: 6.30am(Tamil), 8.30am(Tamil), 5.30pm(English)
+Mon Tue Wed Thu Fri: 6.30am(Tamil)
+Saturday: 6.30am(Tamil), 5.30pm(English)
+```
+
+Day labels: `Sunday`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, `Saturday`, or combinations like `Mon Tue Wed Thu Fri`. Also `Holy Day` entries.
+
+Time format: `H.MMam/pm` — needs conversion to 24h `HH:MM`.
+
+Language in parentheses maps to our `language` field on mass_schedules.
+
+### Country detection
+
+The address is the last line of each church entry. Country can be detected by:
+- GB: UK postal code pattern (e.g. `SW1A 1AA`)
+- Ireland: Irish Eircode (e.g. `D01 F5P2`) or "Ireland" in address
+- India: 6-digit postal code (e.g. `600088`)
+- Others: country name at end of address, or fallback to the area page being scraped
+
+## Design
+
+### Schema
+
+Add to Church model in both BethelGuide and ScraperControl:
+
+```prisma
+weekdayMassesId String? @unique @map("weekday_masses_id")
+@@index([weekdayMassesId])
+```
+
+### Script: `scripts/import-weekdaymasses.ts`
+
+Single script that:
+
+1. Fetches area pages (default: all 3; filterable with `--area gb|ireland|outside-gb|india|...`)
+2. Parses HTML into structured church entries
+3. Converts mass times from `H.MMam/pm` to `HH:MM` 24h format
+4. Detects country from address patterns
+5. Matches against existing churches by `weekdayMassesId` (exact) then proximity+name
+6. Upserts churches and replaces mass schedules
+
+### HTML parsing strategy
+
+Each church is a block between consecutive h3 headings. Within each block:
+- h3 content = church name
+- Lines with day labels + times = mass schedule
+- Map link = coordinates + church_id
+- Last text block before next h3 = address
+- `Tel:` prefix = phone
+
+### CLI flags
+
+- `--all` — import all 3 area pages
+- `--area <name>` — import specific area (gb, ireland, outside-gb, india, sri-lanka, etc.)
+- `--dry-run` — no database writes
+- `--resume-from <n>` — skip first N churches
+- `--job-id <uuid>` — background job tracking
+
+### Church matcher integration
+
+Add `weekdayMassesId` to `ExistingChurch`, `ChurchCandidate`, and a new match pass in `findDuplicateChurch()`.
+
+### Scheduler integration
+
+Add `weekdaymasses-import` to the sequential imports group in the pipeline, with `getJobCommand()` case and npm script.
+
+## Scope
+
+- ~3,500-4,000 churches with mass schedules
+- Most GB/Ireland churches already in DB from OSM (will match and add schedules)
+- India/Sri Lanka/international churches partially in DB from OSM/gcatholic
+- Value: mass schedule data for thousands of churches that currently have none