chore: sync with Gitea master and restore local-only files

Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Albert
2026-04-12 19:11:22 -04:00
parent 76cca3ba75
commit 2c51513851
133 changed files with 30381 additions and 0 deletions

View File

@@ -0,0 +1,43 @@
# Parallel Scrapers with Country Mapping Fix
## Problem
The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.
## Changes
### 1. Country Mapping Additions (scraper-service.ts)
Add to `COUNTRY_SCRAPER_MAP`:
- **English**: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
- **French**: BE, LU
- **German**: CH, SI
- **Italian**: HR, RO
### 2. Parallel Pipeline Groups (scheduler.ts)
Replace sequential `PIPELINE_PHASES` array with grouped phases:
| Group | Phases | Concurrency |
|-------|--------|-------------|
| 1 | osm-import, gcatholic-import | Sequential (shared data) |
| 2 | english, french, german | Parallel (3) |
| 3 | polish, spanish, italian | Parallel (3) |
| 4 | portuguese, czech, dutch | Parallel (3) |
| 5 | hungarian, generic | Parallel (2) |
Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.
### 3. Generic Scraper Deprioritized
- Moved to last group
- Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)
### 4. Resource Changes
- Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
- No new Docker containers or compose changes needed — existing child process spawning approach is kept
## Approach
Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns `npx tsx` processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.

View File

@@ -0,0 +1,423 @@
# Parallel Scrapers Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Run language scrapers in parallel groups of 3, add missing country mappings, and deprioritize the generic scraper.
**Architecture:** Replace sequential pipeline phases with grouped phases. Groups run their jobs concurrently (max 3), then wait for all to complete before advancing. Import phases stay sequential. The scheduler tracks a `groupJobsRemaining` counter per group instead of advancing on every job completion.
**Tech Stack:** TypeScript, node child_process spawn, Prisma, Docker Compose
---
### Task 1: Add Missing Country Mappings
**Files:**
- Modify: `src/lib/scraper-service.ts:29-45`
**Step 1: Update COUNTRY_SCRAPER_MAP**
Add these entries to the existing `COUNTRY_SCRAPER_MAP` object at `src/lib/scraper-service.ts:29`:
```typescript
const COUNTRY_SCRAPER_MAP: Record<string, string> = {
US: 'english', CA: 'english', GB: 'english',
AU: 'english', NZ: 'english', IE: 'english', PH: 'english',
IN: 'english', SG: 'english', MY: 'english', KE: 'english',
JM: 'english', TT: 'english', GH: 'english', NG: 'english',
ZA: 'english', TZ: 'english', UG: 'english',
FR: 'french', BE: 'french', LU: 'french',
ES: 'spanish', MX: 'spanish', AR: 'spanish', CO: 'spanish',
CL: 'spanish', PE: 'spanish', EC: 'spanish', VE: 'spanish',
CR: 'spanish', PA: 'spanish', GT: 'spanish', CU: 'spanish',
HN: 'spanish', SV: 'spanish', NI: 'spanish', BO: 'spanish',
PY: 'spanish', UY: 'spanish', DO: 'spanish',
IT: 'italian', SM: 'italian', VA: 'italian',
HR: 'italian', RO: 'italian',
DE: 'german', AT: 'german', LI: 'german',
CH: 'german', SI: 'german',
PL: 'polish',
PT: 'portuguese', BR: 'portuguese',
NL: 'dutch',
CZ: 'czech', SK: 'czech',
HU: 'hungarian',
};
```
Also update `buildLanguageFilter` at `src/lib/scraper-service.ts:346-463` to include the new countries in each language filter's country list:
- `english` filter (line 356): add `'IN', 'SG', 'MY', 'KE', 'JM', 'TT', 'GH', 'NG', 'ZA', 'TZ', 'UG'`
- `french` filter (line 366): add `'BE', 'LU'``{ in: ['FR', 'BE', 'LU'] }`
- `spanish` filter: already has all needed countries
- `italian` filter (line 387): add `'HR', 'RO'``{ in: ['IT', 'SM', 'VA', 'HR', 'RO'] }`
- `german` filter (line 397): add `'CH', 'SI'``{ in: ['DE', 'AT', 'LI', 'CH', 'SI'] }`
**Step 2: Verify build**
Run: `npm run build`
Expected: Build succeeds with no errors
**Step 3: Commit**
```bash
git add src/lib/scraper-service.ts
git commit -m "feat: add missing country mappings to language scrapers
Add BE/LU→french, CH/SI→german, HR/RO→italian, IN/SG/MY/KE/JM/TT/GH/NG/ZA/TZ/UG→english.
~1,400 previously unmapped churches now routed to proper language scrapers."
```
---
### Task 2: Rewrite Scheduler for Parallel Groups
**Files:**
- Modify: `scripts/scheduler.ts`
**Step 1: Replace pipeline data structure**
Replace the `PipelinePhase` interface, `PIPELINE_PHASES` array (lines 27-49), and `CycleState` interface (lines 53-69) with:
```typescript
interface PipelinePhase {
name: string;
type: string;
language?: string;
config: Record<string, unknown>;
}
interface PipelineGroup {
name: string;
phases: PipelinePhase[];
mode: 'sequential' | 'parallel';
}
const PIPELINE_GROUPS: PipelineGroup[] = [
{
name: 'imports',
mode: 'sequential',
phases: [
{ name: 'osm-import-p1', type: 'osm-import', config: { priority: 1 } },
{ name: 'gcatholic-import', type: 'gcatholic-import', config: { delay: 2000 } },
],
},
{
name: 'scrapers-batch-1',
mode: 'parallel',
phases: [
{ name: 'scraper-english', type: 'scraper', language: 'english', config: { allMode: true, maxFailures: 10, language: 'english' } },
{ name: 'scraper-french', type: 'scraper', language: 'french', config: { allMode: true, maxFailures: 10, language: 'french' } },
{ name: 'scraper-german', type: 'scraper', language: 'german', config: { allMode: true, maxFailures: 10, language: 'german' } },
],
},
{
name: 'scrapers-batch-2',
mode: 'parallel',
phases: [
{ name: 'scraper-polish', type: 'scraper', language: 'polish', config: { allMode: true, maxFailures: 10, language: 'polish' } },
{ name: 'scraper-spanish', type: 'scraper', language: 'spanish', config: { allMode: true, maxFailures: 10, language: 'spanish' } },
{ name: 'scraper-italian', type: 'scraper', language: 'italian', config: { allMode: true, maxFailures: 10, language: 'italian' } },
],
},
{
name: 'scrapers-batch-3',
mode: 'parallel',
phases: [
{ name: 'scraper-portuguese', type: 'scraper', language: 'portuguese', config: { allMode: true, maxFailures: 10, language: 'portuguese' } },
{ name: 'scraper-czech', type: 'scraper', language: 'czech', config: { allMode: true, maxFailures: 10, language: 'czech' } },
{ name: 'scraper-dutch', type: 'scraper', language: 'dutch', config: { allMode: true, maxFailures: 10, language: 'dutch' } },
],
},
{
name: 'scrapers-batch-4',
mode: 'parallel',
phases: [
{ name: 'scraper-hungarian', type: 'scraper', language: 'hungarian', config: { allMode: true, maxFailures: 10, language: 'hungarian' } },
{ name: 'scraper-generic', type: 'scraper', language: 'generic', config: { allMode: true, maxFailures: 10, language: 'generic' } },
],
},
];
```
**Step 2: Replace CycleState**
```typescript
interface CycleState {
currentGroupIndex: number;
currentSequentialPhaseIndex: number; // for sequential groups, tracks which phase within the group
cycleNumber: number;
cycleStartedAt: Date | null;
lastCycleCompletedAt: Date | null;
waitingForCooldown: boolean;
activeGroupJobs: number; // how many jobs still running in the current group
}
const cycleState: CycleState = {
currentGroupIndex: 0,
currentSequentialPhaseIndex: 0,
cycleNumber: 0,
cycleStartedAt: null,
lastCycleCompletedAt: null,
waitingForCooldown: false,
activeGroupJobs: 0,
};
```
**Step 3: Rewrite pollAndAdvancePipeline**
Replace the entire `pollAndAdvancePipeline` function (lines 306-385) and `advancePipelinePhase` function (lines 387-390) with:
```typescript
async function pollAndAdvancePipeline(): Promise<void> {
try {
// 1. Check for manual pending jobs from admin API (priority over pipeline)
if (runningJobs.size === 0) {
const manualJob = await prisma.backgroundJob.findFirst({
where: {
status: 'pending',
NOT: { config: { path: ['pipelineManaged'], equals: true } },
},
orderBy: { createdAt: 'asc' },
});
if (manualJob) {
log(`Found manual job: ${manualJob.type}${manualJob.language ? `:${manualJob.language}` : ''} (${manualJob.id})`);
await startJobProcess(
manualJob.id,
manualJob.type,
manualJob.language,
manualJob.config as Record<string, unknown> | null
);
return;
}
}
// 2. If jobs are still running for the current group, wait
if (cycleState.activeGroupJobs > 0) {
return;
}
// 3. If in cooldown, check if expired
if (cycleState.waitingForCooldown) {
if (cycleState.lastCycleCompletedAt) {
const elapsed = Date.now() - cycleState.lastCycleCompletedAt.getTime();
if (elapsed < CYCLE_COOLDOWN_MS) {
const remaining = Math.round((CYCLE_COOLDOWN_MS - elapsed) / 60_000);
if (remaining % 30 === 0 || remaining <= 5) {
log(`Cooldown: ${remaining} minutes remaining before next cycle`);
}
return;
}
}
cycleState.waitingForCooldown = false;
cycleState.currentGroupIndex = 0;
cycleState.currentSequentialPhaseIndex = 0;
log('Cooldown expired, starting new cycle');
}
// 4. If past the last group, complete the cycle
if (cycleState.currentGroupIndex >= PIPELINE_GROUPS.length) {
cycleState.cycleNumber++;
cycleState.lastCycleCompletedAt = new Date();
cycleState.waitingForCooldown = true;
const cooldownHours = CYCLE_COOLDOWN_MS / (60 * 60 * 1000);
log(`=== Cycle ${cycleState.cycleNumber} complete! Entering ${cooldownHours}h cooldown ===`);
return;
}
// 5. Start the current group
const group = PIPELINE_GROUPS[cycleState.currentGroupIndex];
if (cycleState.currentGroupIndex === 0 && cycleState.currentSequentialPhaseIndex === 0 && !cycleState.cycleStartedAt) {
cycleState.cycleStartedAt = new Date();
log(`=== Starting cycle ${cycleState.cycleNumber + 1} ===`);
}
if (group.mode === 'parallel') {
// Launch all phases in the group concurrently
log(`Pipeline group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length}: ${group.name} (parallel, ${group.phases.length} jobs)`);
cycleState.activeGroupJobs = group.phases.length;
for (const phase of group.phases) {
const jobId = await createPendingJob(
phase.type,
phase.language,
{ ...phase.config, pipelineManaged: true }
);
await startJobProcess(jobId, phase.type, phase.language || null, phase.config);
}
} else {
// Sequential: run one phase at a time within the group
const phaseIndex = cycleState.currentSequentialPhaseIndex;
if (phaseIndex >= group.phases.length) {
// All phases in this sequential group are done
cycleState.currentGroupIndex++;
cycleState.currentSequentialPhaseIndex = 0;
return; // Will pick up next group on next poll
}
const phase = group.phases[phaseIndex];
log(`Pipeline group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length}: ${group.name} (sequential ${phaseIndex + 1}/${group.phases.length}: ${phase.name})`);
cycleState.activeGroupJobs = 1;
const jobId = await createPendingJob(
phase.type,
phase.language,
{ ...phase.config, pipelineManaged: true }
);
await startJobProcess(jobId, phase.type, phase.language || null, phase.config);
}
} catch (err) {
logError(`Error in pipeline: ${err}`);
}
}
function onJobCompleted(): void {
cycleState.activeGroupJobs--;
if (cycleState.activeGroupJobs <= 0) {
cycleState.activeGroupJobs = 0;
const group = PIPELINE_GROUPS[cycleState.currentGroupIndex];
if (group?.mode === 'sequential') {
cycleState.currentSequentialPhaseIndex++;
// Check if there are more phases in this sequential group
if (cycleState.currentSequentialPhaseIndex < group.phases.length) {
return; // Don't advance group yet
}
}
// Advance to next group
cycleState.currentGroupIndex++;
cycleState.currentSequentialPhaseIndex = 0;
log(`Group "${group?.name}" complete, advancing to group ${cycleState.currentGroupIndex + 1}`);
}
}
```
**Step 4: Update startJobProcess callbacks**
In the `child.on('close')` callback (line 442) and `child.on('error')` callback (line 472), replace `advancePipelinePhase()` with `onJobCompleted()`.
**Step 5: Update crash recovery**
In `recoverFromCrash` (lines 259-268), replace the `PIPELINE_PHASES.findIndex` logic with a search through `PIPELINE_GROUPS`:
```typescript
if (lastRunningPipelineJob) {
for (let gi = 0; gi < PIPELINE_GROUPS.length; gi++) {
const group = PIPELINE_GROUPS[gi];
const phaseIdx = group.phases.findIndex(
p => p.type === lastRunningPipelineJob.type &&
(p.language || null) === (lastRunningPipelineJob.language || null)
);
if (phaseIdx >= 0) {
cycleState.currentGroupIndex = gi;
cycleState.currentSequentialPhaseIndex = group.mode === 'sequential' ? phaseIdx : 0;
log(`Resuming pipeline from group ${gi + 1}: ${group.name}`);
break;
}
}
}
```
**Step 6: Update heartbeat log in main()**
Replace the heartbeat cron (lines 551-562) and the startup log (lines 574-580) to reference groups instead of phases:
```typescript
cron.schedule('0 * * * *', () => {
const currentGroup = cycleState.currentGroupIndex < PIPELINE_GROUPS.length
? PIPELINE_GROUPS[cycleState.currentGroupIndex].name
: 'none';
const jobs = runningJobs.size > 0
? `Running: ${[...runningJobs.keys()].join(', ')}`
: 'No jobs running';
const state = cycleState.waitingForCooldown
? 'cooldown'
: `group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length} (${currentGroup})`;
log(`Heartbeat: Cycle ${cycleState.cycleNumber + 1}, ${state}. ${jobs}`);
}, { timezone: 'UTC' });
```
For the startup log:
```typescript
log('=== Scheduler running (parallel grouped pipeline) ===');
log(`Pipeline groups (${PIPELINE_GROUPS.length}):`);
for (let i = 0; i < PIPELINE_GROUPS.length; i++) {
const g = PIPELINE_GROUPS[i];
const phaseNames = g.phases.map(p => p.name).join(', ');
log(` ${i + 1}. ${g.name} [${g.mode}]: ${phaseNames}`);
}
```
**Step 7: Remove dead Google Places env log**
Delete lines 167-169 (the `GOOGLE_PLACES_API_KEY` log in `validateEnvironment`).
**Step 8: Verify build**
Run: `npm run build`
Expected: Build succeeds
**Step 9: Commit**
```bash
git add scripts/scheduler.ts
git commit -m "feat: parallel grouped pipeline scheduler
Replace sequential pipeline with grouped phases. Import phases run
sequentially, scraper phases run in parallel groups of 3. This reduces
cycle time from days to hours. Generic scraper moved to last group."
```
---
### Task 3: Increase Scheduler Memory Limit
**Files:**
- Modify: `docker-compose.yml:217-220`
**Step 1: Increase memory limit**
Change the scheduler service's `deploy.resources.limits.memory` from `4G` to `10G`:
```yaml
deploy:
resources:
limits:
memory: 10G
```
**Step 2: Commit**
```bash
git add docker-compose.yml
git commit -m "chore: increase scheduler memory to 10G for parallel scrapers"
```
---
### Task 4: Deploy and Verify
**Step 1: Deploy to NAS**
```bash
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
/Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/
```
**Step 2: Rebuild and restart scheduler**
```bash
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scheduler && /usr/local/bin/docker compose up -d scheduler'
```
**Step 3: Verify logs show parallel groups**
```bash
ssh albert@192.168.0.145 '/usr/local/bin/docker logs --tail 30 scraper-control-scheduler-1'
```
Expected: Logs show "parallel grouped pipeline", group listings with `[parallel]` and `[sequential]` tags, and eventually multiple concurrent `Running:` entries in heartbeat.

View File

@@ -0,0 +1,72 @@
# Spain Church Importer (horariosmisas.com) — Design
## Overview
Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
## Data Source
- **Site:** https://horariosmisas.com
- **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
- **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
- **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass.
- **robots.txt:** Fully permissive (`User-agent: * / Disallow:`)
- **Sitemaps:** 20 post sitemaps + 7 category sitemaps
## Architecture
### Two-Pass Approach
**Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
**Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit.
### Schema Change
Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers.
### URL Structure
- Sitemap index: `/sitemap_index.xml` → 20 post sitemaps
- Church pages: `/{province}/{city}/{church-slug}/`
- Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc.
### HTML Parsing
- **Name:** `<h1>Church Name (City)</h1>` — strip `(City)` suffix
- **Address:** `<p>📌 <strong>Street, PostalCode City (Province)</strong></p>`
- **Phone:** `<strong>Teléfono:</strong> <a href="tel:...">...</a>`
- **Website:** `<strong>Página Web:</strong> <a href="...">...</a>`
- **Schedule:** `<table>` with `DÍA`/`HORARIO` columns
- Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno`
- Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
- Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
- Day ranges: "Lunes a Viernes" (Monday-Friday)
- Time format: `HH:MMh` (24-hour), multiple per cell via `<br>`
- Annotations stripped: `(familias)`, etc.
### Matching Strategy
1. `horariosMisasId` exact match (for re-imports)
2. Name + proximity against existing Spanish churches (from OSM)
3. Unmatched: create new church with address, country=ES, no coordinates
### CLI
```
npx tsx scripts/import-horariosmisas.ts --all
npx tsx scripts/import-horariosmisas.ts --all --dry-run
npx tsx scripts/import-horariosmisas.ts --province madrid
npx tsx scripts/import-horariosmisas.ts --all --geocode
npx tsx scripts/import-horariosmisas.ts --geocode-only
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
```
### Rate Limiting
- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
- Geocode: 1s between requests (Nominatim public API limit)
### Scheduler Integration
Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).

View File

@@ -0,0 +1,322 @@
# Spain Church Importer (horariosmisas.com) — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Import ~10,000 Spanish churches with mass schedules from horariosmisas.com, with optional Nominatim forward geocoding for unmatched churches.
**Architecture:** Sitemap-driven importer. Fetch 20 post sitemaps for church URLs, parse static WordPress HTML for names/addresses/schedule tables, match against existing Spanish OSM churches, upsert with mass schedules. Separate geocoding pass via Nominatim public API.
**Tech Stack:** TypeScript, Prisma, HTML parsing (regex — no Playwright), Nominatim geocoding API.
---
## Task 1: Add `horariosMisasId` to Prisma Schema
**Files:**
- Modify: `prisma/schema.prisma`
**Step 1: Add field and index**
After the `philmassId` line (around line 38), add:
```prisma
horariosMisasId String? @unique @map("horarios_misas_id") // horariosmisas.com URL slug
```
And add an index in the `@@index` block (around line 78):
```prisma
@@index([horariosMisasId])
```
**Step 2: Push schema to NAS database**
```bash
npx prisma db push --accept-data-loss
```
Expected: `Your database is now in sync with your Prisma schema.`
**Step 3: Regenerate Prisma client**
```bash
npx prisma generate
```
**Step 4: Push schema to Neon production**
```bash
npx prisma db push --url "$(grep DATABASE_URL .env.production | sed 's/DATABASE_URL="//' | sed 's/"$//')" --accept-data-loss
```
**Step 5: Commit**
```bash
git add prisma/schema.prisma
git commit -m "feat: add horariosMisasId to Church model for Spain import"
```
---
## Task 2: Extend Church Matcher and Existing Importers
**Files:**
- Modify: `src/lib/church-matcher.ts`
- Modify: `scripts/import-osm-churches.ts`
- Modify: `scripts/import-gcatholic.ts`
- Modify: `scripts/import-baidu-churches.ts`
- Modify: `scripts/import-osm-region.ts`
- Modify: `scripts/import-orarimesse.ts`
- Modify: `scripts/import-mass-schedules-ph.ts`
- Modify: `scripts/import-philmass.ts`
### Step 1: Update church-matcher.ts
In `ExistingChurch` interface (line ~11-26), add after `philmassId`:
```typescript
horariosMisasId: string | null;
```
In `ChurchCandidate` type (line ~113-122), add after `philmassId`:
```typescript
horariosMisasId?: string;
```
In `findDuplicateChurch()`, add a new pass after the fifth pass (philmassId match, line ~169-175). Before the proximity+name pass:
```typescript
// Sixth pass: exact horariosMisasId match
if (candidate.horariosMisasId) {
const horariosMisasMatch = existingChurches.find(
(church) => church.horariosMisasId === candidate.horariosMisasId
);
if (horariosMisasMatch) return horariosMisasMatch;
}
```
Update the comment on the proximity pass to say "Seventh pass".
### Step 2: Update all existing importers
In every importer that queries churches with a `select` clause containing `philmassId: true`, add:
```typescript
horariosMisasId: true,
```
In every importer that creates/pushes churches with `philmassId: null`, add:
```typescript
horariosMisasId: null,
```
**Files to update:** `import-osm-churches.ts`, `import-gcatholic.ts`, `import-baidu-churches.ts`, `import-osm-region.ts`, `import-orarimesse.ts`, `import-mass-schedules-ph.ts`, `import-philmass.ts`
### Step 3: Verify build
```bash
npx tsc --noEmit
```
Expected: No errors.
### Step 4: Commit
```bash
git add src/lib/church-matcher.ts scripts/import-*.ts
git commit -m "feat: add horariosMisasId to church matcher and all importers"
```
---
## Task 3: Create `import-horariosmisas.ts`
**Files:**
- Create: `scripts/import-horariosmisas.ts`
### Architecture
This importer follows the exact same structure as `scripts/import-mass-schedules-ph.ts`. Key differences:
- **Sitemap:** Fetches 20 post sitemaps from sitemap index (not a single sitemap)
- **URL filtering:** Church URLs have 3 path segments (`/{province}/{city}/{slug}/`). Non-church URLs (blog posts, daily readings) are filtered out.
- **Schedule parsing:** Two seasonal tables (summer/winter). Import seasonally appropriate one based on current month.
- **Day names:** Spanish (`Lunes`, `Martes`, etc.) with range support (`Lunes a Viernes`)
- **Times:** 24-hour `HH:MMh` format (e.g., `08:00h`, `20:30h`)
- **No coordinates:** Churches created with `latitude: 0, longitude: 0` — geocoded separately
- **Geocoding:** Optional `--geocode` flag uses Nominatim public API (1 req/sec)
### Constants
```typescript
const SITE_BASE = 'https://horariosmisas.com';
const SITEMAP_INDEX_URL = `${SITE_BASE}/sitemap_index.xml`;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 1500;
const NOMINATIM_DELAY_MS = 1100;
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';
```
### Spanish Day Mapping
```typescript
const DAY_MAP: Record<string, number[]> = {
'domingos y festivos': [0],
'domingos': [0],
'domingo': [0],
'lunes': [1],
'martes': [2],
'miércoles': [3],
'miercoles': [3],
'jueves': [4],
'viernes': [5],
'sábado': [6],
'sabado': [6],
'sábados': [6],
'sabados': [6],
};
```
### Sitemap Fetching
1. Fetch sitemap index → extract `post-sitemap*.xml` URLs
2. Fetch each post sitemap → extract URLs with exactly 3 path segments
3. Filter out non-church URLs (patterns: `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, `/noticias/`, `/blog/`, `/contacto/`, `/aviso-legal/`, `/politica-de-privacidad/`, `/politica-de-cookies/`)
4. Deduplicate by slug
### HTML Parsing
**Church name:** `<h1>Church Name (City)</h1>` → strip `(City)` suffix
**Address:** `📌 <strong>Calle Goya, 26 28001 Madrid (Madrid)</strong>` → extract street, postal code (5-digit `\b\d{5}\b`), city (text after postal code), strip `(Province)` suffix
**Phone:** `<strong>Teléfono:</strong> <a href="tel:...">number</a>`
**Website:** `<strong>Página Web:</strong> <a href="url">...</a>`
**Schedule tables:** Find `<table>` elements with DÍA/HORARIO headers. Split by seasonal headings (☀️ verano / ⛄ invierno). Pick seasonally appropriate section (Oct-May = winter, Jun-Sep = summer). Parse `<td>` cells: first cell = day name(s), second cell = times. Times in `HH:MMh` format extracted via regex `(\d{1,2}):(\d{2})\s*h?`.
### Day Range Resolution
Support ranges like `Lunes a Viernes` → [1,2,3,4,5] and compound entries like `Lunes, Miércoles y Viernes` → [1,3,5].
### Geocoding (--geocode / --geocode-only)
Query Nominatim with: `{address}, Spain` → fallback to `{postalCode} {city}, Spain` → fallback to `{city}, Spain`. Use `countrycodes=es` parameter. Max 1 req/sec.
### Matching Strategy
1. `horariosMisasId` exact match (primary — for re-imports)
2. Name + proximity against existing Spanish OSM churches (secondary)
3. Unmatched: create new church with `latitude: 0, longitude: 0`, country=ES
### CLI
```
--all Import all churches from sitemaps
--province <name> Import only churches from this province
--dry-run No database writes
--geocode After import, geocode unmatched churches
--geocode-only Only geocode (skip import)
--resume-from <n> Skip first N churches
--job-id <uuid> Background job tracking
```
### Mass Schedule Language
Set `language: 'Spanish'` on all created mass schedules.
### Step 1: Create the file
Use `scripts/import-mass-schedules-ph.ts` as the structural template. Implement all functions described above.
### Step 2: Verify build
```bash
npx tsc --noEmit
```
### Step 3: Dry-run test
```bash
npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run
```
### Step 4: Commit
```bash
git add scripts/import-horariosmisas.ts
git commit -m "feat: add horariosmisas.com Spain church importer"
```
---
## Task 4: Add to Scheduler Pipeline and npm Scripts
**Files:**
- Modify: `scripts/scheduler.ts`
- Modify: `package.json`
### Step 1: Add to PIPELINE_GROUPS
In `scripts/scheduler.ts`, in the `imports` group (line ~40-51), add after the `philmass-import` entry:
```typescript
{ name: 'horariosmisas-import', type: 'horariosmisas-import', config: {} },
```
### Step 2: Add getJobCommand case
In the `getJobCommand` function (around line ~182), before the `default:` case, add:
```typescript
case 'horariosmisas-import': {
const args = ['tsx', 'scripts/import-horariosmisas.ts', '--all', '--geocode'];
if (config?.province) args.push('--province', String(config.province));
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
```
### Step 3: Add npm scripts
In `package.json`, add after the `"import:philmass"` line:
```json
"import:horariosmisas": "tsx scripts/import-horariosmisas.ts",
```
### Step 4: Verify build
```bash
npx tsc --noEmit
```
### Step 5: Commit
```bash
git add scripts/scheduler.ts package.json
git commit -m "feat: add horariosmisas import to scheduler pipeline"
```
---
## Verification
1. **Dry run on single province**: `npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run`
- Verify: church names parsed correctly, schedules extracted, matches found
2. **Dry run on Madrid**: `npx tsx scripts/import-horariosmisas.ts --province madrid --dry-run`
- Verify: larger province, summer/winter schedule selection, address parsing
3. **Single province real import**: `npx tsx scripts/import-horariosmisas.ts --province navarra`
- Verify: churches created/updated, mass schedules in database
4. **Geocode test**: `npx tsx scripts/import-horariosmisas.ts --geocode-only --dry-run`
- Verify: finds churches needing geocoding, Nominatim returns coordinates
5. **Full import**: `npx tsx scripts/import-horariosmisas.ts --all --geocode`
## Runtime Estimate
- Sitemap fetch: 20 sitemaps x 1.5s = ~30s
- Import: ~10,000 churches x 1.5s = ~4.2 hours
- Geocode: depends on unmatched count x 1.1s

View File

@@ -0,0 +1,103 @@
# weekdaymasses.org.uk Global Importer
## Context
weekdaymasses.org.uk is a UK-based Catholic directory covering ~3,500-4,000 churches globally with mass schedules, coordinates, addresses, and phone numbers. Covers GB, Ireland, and 49+ international countries (India, Sri Lanka, South Korea, Japan, and more). All data served on single HTML pages per area — no pagination or API needed.
## Data Source
Three area pages cover the entire site:
| Page | URL | Est. Churches |
|------|-----|---------------|
| GB | `/en/area/gb/churches` | ~3,000+ |
| Ireland | `/en/area/ireland/churches` | ~300+ |
| Outside GB | `/en/area/outside-gb/churches` | ~152+ |
Individual country/region pages (e.g. `/en/area/india/churches`) are subsets of these three.
### Data per church
- **Name**: h3 heading, format "Church Name (Location)"
- **Address**: plain text after mass times, with postal/zip code
- **Coordinates**: in map link query params `lat=XX.XXXX&lon=YY.YYYY&church_id=NNNNN`
- **Mass times**: format `Day: HH.MMam/pm(Language), HH.MMam/pm(Language)`
- **Phone**: `Tel: +XX XXXX XXXXXX`
- **Website**: occasional links
- **church_id**: unique numeric identifier in map links
### Mass time format
```
Sunday: 6.30am(Tamil), 8.30am(Tamil), 5.30pm(English)
Mon Tue Wed Thu Fri: 6.30am(Tamil)
Saturday: 6.30am(Tamil), 5.30pm(English)
```
Day labels: `Sunday`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, `Saturday`, or combinations like `Mon Tue Wed Thu Fri`. Also `Holy Day` entries.
Time format: `H.MMam/pm` — needs conversion to 24h `HH:MM`.
Language in parentheses maps to our `language` field on mass_schedules.
### Country detection
The address is the last line of each church entry. Country can be detected by:
- GB: UK postal code pattern (e.g. `SW1A 1AA`)
- Ireland: Irish Eircode (e.g. `D01 F5P2`) or "Ireland" in address
- India: 6-digit postal code (e.g. `600088`)
- Others: country name at end of address, or fallback to the area page being scraped
## Design
### Schema
Add to Church model in both BethelGuide and ScraperControl:
```prisma
weekdayMassesId String? @unique @map("weekday_masses_id")
@@index([weekdayMassesId])
```
### Script: `scripts/import-weekdaymasses.ts`
Single script that:
1. Fetches area pages (default: all 3; filterable with `--area gb|ireland|outside-gb|india|...`)
2. Parses HTML into structured church entries
3. Converts mass times from `H.MMam/pm` to `HH:MM` 24h format
4. Detects country from address patterns
5. Matches against existing churches by `weekdayMassesId` (exact) then proximity+name
6. Upserts churches and replaces mass schedules
### HTML parsing strategy
Each church is a block between consecutive h3 headings. Within each block:
- h3 content = church name
- Lines with day labels + times = mass schedule
- Map link = coordinates + church_id
- Last text block before next h3 = address
- `Tel:` prefix = phone
### CLI flags
- `--all` — import all 3 area pages
- `--area <name>` — import specific area (gb, ireland, outside-gb, india, sri-lanka, etc.)
- `--dry-run` — no database writes
- `--resume-from <n>` — skip first N churches
- `--job-id <uuid>` — background job tracking
### Church matcher integration
Add `weekdayMassesId` to `ExistingChurch`, `ChurchCandidate`, and a new match pass in `findDuplicateChurch()`.
### Scheduler integration
Add `weekdaymasses-import` to the sequential imports group in the pipeline, with `getJobCommand()` case and npm script.
## Scope
- ~3,500-4,000 churches with mass schedules
- Most GB/Ireland churches already in DB from OSM (will match and add schedules)
- India/Sri Lanka/international churches partially in DB from OSM/gcatholic
- Value: mass schedule data for thousands of churches that currently have none