chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
43
docs/plans/2026-02-25-parallel-scrapers-design.md
Normal file
43
docs/plans/2026-02-25-parallel-scrapers-design.md
Normal file
@@ -0,0 +1,43 @@
|
||||
# Parallel Scrapers with Country Mapping Fix
|
||||
|
||||
## Problem
|
||||
|
||||
The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.
|
||||
|
||||
## Changes
|
||||
|
||||
### 1. Country Mapping Additions (scraper-service.ts)
|
||||
|
||||
Add to `COUNTRY_SCRAPER_MAP`:
|
||||
- **English**: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
|
||||
- **French**: BE, LU
|
||||
- **German**: CH, SI
|
||||
- **Italian**: HR, RO
|
||||
|
||||
### 2. Parallel Pipeline Groups (scheduler.ts)
|
||||
|
||||
Replace sequential `PIPELINE_PHASES` array with grouped phases:
|
||||
|
||||
| Group | Phases | Concurrency |
|
||||
|-------|--------|-------------|
|
||||
| 1 | osm-import, gcatholic-import | Sequential (shared data) |
|
||||
| 2 | english, french, german | Parallel (3) |
|
||||
| 3 | polish, spanish, italian | Parallel (3) |
|
||||
| 4 | portuguese, czech, dutch | Parallel (3) |
|
||||
| 5 | hungarian, generic | Parallel (2) |
|
||||
|
||||
Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.
|
||||
|
||||
### 3. Generic Scraper Deprioritized
|
||||
|
||||
- Moved to last group
|
||||
- Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)
|
||||
|
||||
### 4. Resource Changes
|
||||
|
||||
- Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
|
||||
- No new Docker containers or compose changes needed — existing child process spawning approach is kept
|
||||
|
||||
## Approach
|
||||
|
||||
Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns `npx tsx` processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.
|
||||
423
docs/plans/2026-02-25-parallel-scrapers.md
Normal file
423
docs/plans/2026-02-25-parallel-scrapers.md
Normal file
@@ -0,0 +1,423 @@
|
||||
# Parallel Scrapers Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Run language scrapers in parallel groups of 3, add missing country mappings, and deprioritize the generic scraper.
|
||||
|
||||
**Architecture:** Replace sequential pipeline phases with grouped phases. Groups run their jobs concurrently (max 3), then wait for all to complete before advancing. Import phases stay sequential. The scheduler tracks a `groupJobsRemaining` counter per group instead of advancing on every job completion.
|
||||
|
||||
**Tech Stack:** TypeScript, node child_process spawn, Prisma, Docker Compose
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add Missing Country Mappings
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/scraper-service.ts:29-45`
|
||||
|
||||
**Step 1: Update COUNTRY_SCRAPER_MAP**
|
||||
|
||||
Add these entries to the existing `COUNTRY_SCRAPER_MAP` object at `src/lib/scraper-service.ts:29`:
|
||||
|
||||
```typescript
|
||||
const COUNTRY_SCRAPER_MAP: Record<string, string> = {
|
||||
US: 'english', CA: 'english', GB: 'english',
|
||||
AU: 'english', NZ: 'english', IE: 'english', PH: 'english',
|
||||
IN: 'english', SG: 'english', MY: 'english', KE: 'english',
|
||||
JM: 'english', TT: 'english', GH: 'english', NG: 'english',
|
||||
ZA: 'english', TZ: 'english', UG: 'english',
|
||||
FR: 'french', BE: 'french', LU: 'french',
|
||||
ES: 'spanish', MX: 'spanish', AR: 'spanish', CO: 'spanish',
|
||||
CL: 'spanish', PE: 'spanish', EC: 'spanish', VE: 'spanish',
|
||||
CR: 'spanish', PA: 'spanish', GT: 'spanish', CU: 'spanish',
|
||||
HN: 'spanish', SV: 'spanish', NI: 'spanish', BO: 'spanish',
|
||||
PY: 'spanish', UY: 'spanish', DO: 'spanish',
|
||||
IT: 'italian', SM: 'italian', VA: 'italian',
|
||||
HR: 'italian', RO: 'italian',
|
||||
DE: 'german', AT: 'german', LI: 'german',
|
||||
CH: 'german', SI: 'german',
|
||||
PL: 'polish',
|
||||
PT: 'portuguese', BR: 'portuguese',
|
||||
NL: 'dutch',
|
||||
CZ: 'czech', SK: 'czech',
|
||||
HU: 'hungarian',
|
||||
};
|
||||
```
|
||||
|
||||
Also update `buildLanguageFilter` at `src/lib/scraper-service.ts:346-463` to include the new countries in each language filter's country list:
|
||||
|
||||
- `english` filter (line 356): add `'IN', 'SG', 'MY', 'KE', 'JM', 'TT', 'GH', 'NG', 'ZA', 'TZ', 'UG'`
|
||||
- `french` filter (line 366): add `'BE', 'LU'` → `{ in: ['FR', 'BE', 'LU'] }`
|
||||
- `spanish` filter: already has all needed countries
|
||||
- `italian` filter (line 387): add `'HR', 'RO'` → `{ in: ['IT', 'SM', 'VA', 'HR', 'RO'] }`
|
||||
- `german` filter (line 397): add `'CH', 'SI'` → `{ in: ['DE', 'AT', 'LI', 'CH', 'SI'] }`
|
||||
|
||||
**Step 2: Verify build**
|
||||
|
||||
Run: `npm run build`
|
||||
Expected: Build succeeds with no errors
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add src/lib/scraper-service.ts
|
||||
git commit -m "feat: add missing country mappings to language scrapers
|
||||
|
||||
Add BE/LU→french, CH/SI→german, HR/RO→italian, IN/SG/MY/KE/JM/TT/GH/NG/ZA/TZ/UG→english.
|
||||
~1,400 previously unmapped churches now routed to proper language scrapers."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Rewrite Scheduler for Parallel Groups
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/scheduler.ts`
|
||||
|
||||
**Step 1: Replace pipeline data structure**
|
||||
|
||||
Replace the `PipelinePhase` interface, `PIPELINE_PHASES` array (lines 27-49), and `CycleState` interface (lines 53-69) with:
|
||||
|
||||
```typescript
|
||||
interface PipelinePhase {
|
||||
name: string;
|
||||
type: string;
|
||||
language?: string;
|
||||
config: Record<string, unknown>;
|
||||
}
|
||||
|
||||
interface PipelineGroup {
|
||||
name: string;
|
||||
phases: PipelinePhase[];
|
||||
mode: 'sequential' | 'parallel';
|
||||
}
|
||||
|
||||
const PIPELINE_GROUPS: PipelineGroup[] = [
|
||||
{
|
||||
name: 'imports',
|
||||
mode: 'sequential',
|
||||
phases: [
|
||||
{ name: 'osm-import-p1', type: 'osm-import', config: { priority: 1 } },
|
||||
{ name: 'gcatholic-import', type: 'gcatholic-import', config: { delay: 2000 } },
|
||||
],
|
||||
},
|
||||
{
|
||||
name: 'scrapers-batch-1',
|
||||
mode: 'parallel',
|
||||
phases: [
|
||||
{ name: 'scraper-english', type: 'scraper', language: 'english', config: { allMode: true, maxFailures: 10, language: 'english' } },
|
||||
{ name: 'scraper-french', type: 'scraper', language: 'french', config: { allMode: true, maxFailures: 10, language: 'french' } },
|
||||
{ name: 'scraper-german', type: 'scraper', language: 'german', config: { allMode: true, maxFailures: 10, language: 'german' } },
|
||||
],
|
||||
},
|
||||
{
|
||||
name: 'scrapers-batch-2',
|
||||
mode: 'parallel',
|
||||
phases: [
|
||||
{ name: 'scraper-polish', type: 'scraper', language: 'polish', config: { allMode: true, maxFailures: 10, language: 'polish' } },
|
||||
{ name: 'scraper-spanish', type: 'scraper', language: 'spanish', config: { allMode: true, maxFailures: 10, language: 'spanish' } },
|
||||
{ name: 'scraper-italian', type: 'scraper', language: 'italian', config: { allMode: true, maxFailures: 10, language: 'italian' } },
|
||||
],
|
||||
},
|
||||
{
|
||||
name: 'scrapers-batch-3',
|
||||
mode: 'parallel',
|
||||
phases: [
|
||||
{ name: 'scraper-portuguese', type: 'scraper', language: 'portuguese', config: { allMode: true, maxFailures: 10, language: 'portuguese' } },
|
||||
{ name: 'scraper-czech', type: 'scraper', language: 'czech', config: { allMode: true, maxFailures: 10, language: 'czech' } },
|
||||
{ name: 'scraper-dutch', type: 'scraper', language: 'dutch', config: { allMode: true, maxFailures: 10, language: 'dutch' } },
|
||||
],
|
||||
},
|
||||
{
|
||||
name: 'scrapers-batch-4',
|
||||
mode: 'parallel',
|
||||
phases: [
|
||||
{ name: 'scraper-hungarian', type: 'scraper', language: 'hungarian', config: { allMode: true, maxFailures: 10, language: 'hungarian' } },
|
||||
{ name: 'scraper-generic', type: 'scraper', language: 'generic', config: { allMode: true, maxFailures: 10, language: 'generic' } },
|
||||
],
|
||||
},
|
||||
];
|
||||
```
|
||||
|
||||
**Step 2: Replace CycleState**
|
||||
|
||||
```typescript
|
||||
interface CycleState {
|
||||
currentGroupIndex: number;
|
||||
currentSequentialPhaseIndex: number; // for sequential groups, tracks which phase within the group
|
||||
cycleNumber: number;
|
||||
cycleStartedAt: Date | null;
|
||||
lastCycleCompletedAt: Date | null;
|
||||
waitingForCooldown: boolean;
|
||||
activeGroupJobs: number; // how many jobs still running in the current group
|
||||
}
|
||||
|
||||
const cycleState: CycleState = {
|
||||
currentGroupIndex: 0,
|
||||
currentSequentialPhaseIndex: 0,
|
||||
cycleNumber: 0,
|
||||
cycleStartedAt: null,
|
||||
lastCycleCompletedAt: null,
|
||||
waitingForCooldown: false,
|
||||
activeGroupJobs: 0,
|
||||
};
|
||||
```
|
||||
|
||||
**Step 3: Rewrite pollAndAdvancePipeline**
|
||||
|
||||
Replace the entire `pollAndAdvancePipeline` function (lines 306-385) and `advancePipelinePhase` function (lines 387-390) with:
|
||||
|
||||
```typescript
|
||||
async function pollAndAdvancePipeline(): Promise<void> {
|
||||
try {
|
||||
// 1. Check for manual pending jobs from admin API (priority over pipeline)
|
||||
if (runningJobs.size === 0) {
|
||||
const manualJob = await prisma.backgroundJob.findFirst({
|
||||
where: {
|
||||
status: 'pending',
|
||||
NOT: { config: { path: ['pipelineManaged'], equals: true } },
|
||||
},
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
|
||||
if (manualJob) {
|
||||
log(`Found manual job: ${manualJob.type}${manualJob.language ? `:${manualJob.language}` : ''} (${manualJob.id})`);
|
||||
await startJobProcess(
|
||||
manualJob.id,
|
||||
manualJob.type,
|
||||
manualJob.language,
|
||||
manualJob.config as Record<string, unknown> | null
|
||||
);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
// 2. If jobs are still running for the current group, wait
|
||||
if (cycleState.activeGroupJobs > 0) {
|
||||
return;
|
||||
}
|
||||
|
||||
// 3. If in cooldown, check if expired
|
||||
if (cycleState.waitingForCooldown) {
|
||||
if (cycleState.lastCycleCompletedAt) {
|
||||
const elapsed = Date.now() - cycleState.lastCycleCompletedAt.getTime();
|
||||
if (elapsed < CYCLE_COOLDOWN_MS) {
|
||||
const remaining = Math.round((CYCLE_COOLDOWN_MS - elapsed) / 60_000);
|
||||
if (remaining % 30 === 0 || remaining <= 5) {
|
||||
log(`Cooldown: ${remaining} minutes remaining before next cycle`);
|
||||
}
|
||||
return;
|
||||
}
|
||||
}
|
||||
cycleState.waitingForCooldown = false;
|
||||
cycleState.currentGroupIndex = 0;
|
||||
cycleState.currentSequentialPhaseIndex = 0;
|
||||
log('Cooldown expired, starting new cycle');
|
||||
}
|
||||
|
||||
// 4. If past the last group, complete the cycle
|
||||
if (cycleState.currentGroupIndex >= PIPELINE_GROUPS.length) {
|
||||
cycleState.cycleNumber++;
|
||||
cycleState.lastCycleCompletedAt = new Date();
|
||||
cycleState.waitingForCooldown = true;
|
||||
const cooldownHours = CYCLE_COOLDOWN_MS / (60 * 60 * 1000);
|
||||
log(`=== Cycle ${cycleState.cycleNumber} complete! Entering ${cooldownHours}h cooldown ===`);
|
||||
return;
|
||||
}
|
||||
|
||||
// 5. Start the current group
|
||||
const group = PIPELINE_GROUPS[cycleState.currentGroupIndex];
|
||||
|
||||
if (cycleState.currentGroupIndex === 0 && cycleState.currentSequentialPhaseIndex === 0 && !cycleState.cycleStartedAt) {
|
||||
cycleState.cycleStartedAt = new Date();
|
||||
log(`=== Starting cycle ${cycleState.cycleNumber + 1} ===`);
|
||||
}
|
||||
|
||||
if (group.mode === 'parallel') {
|
||||
// Launch all phases in the group concurrently
|
||||
log(`Pipeline group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length}: ${group.name} (parallel, ${group.phases.length} jobs)`);
|
||||
cycleState.activeGroupJobs = group.phases.length;
|
||||
|
||||
for (const phase of group.phases) {
|
||||
const jobId = await createPendingJob(
|
||||
phase.type,
|
||||
phase.language,
|
||||
{ ...phase.config, pipelineManaged: true }
|
||||
);
|
||||
await startJobProcess(jobId, phase.type, phase.language || null, phase.config);
|
||||
}
|
||||
} else {
|
||||
// Sequential: run one phase at a time within the group
|
||||
const phaseIndex = cycleState.currentSequentialPhaseIndex;
|
||||
if (phaseIndex >= group.phases.length) {
|
||||
// All phases in this sequential group are done
|
||||
cycleState.currentGroupIndex++;
|
||||
cycleState.currentSequentialPhaseIndex = 0;
|
||||
return; // Will pick up next group on next poll
|
||||
}
|
||||
|
||||
const phase = group.phases[phaseIndex];
|
||||
log(`Pipeline group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length}: ${group.name} (sequential ${phaseIndex + 1}/${group.phases.length}: ${phase.name})`);
|
||||
cycleState.activeGroupJobs = 1;
|
||||
|
||||
const jobId = await createPendingJob(
|
||||
phase.type,
|
||||
phase.language,
|
||||
{ ...phase.config, pipelineManaged: true }
|
||||
);
|
||||
await startJobProcess(jobId, phase.type, phase.language || null, phase.config);
|
||||
}
|
||||
} catch (err) {
|
||||
logError(`Error in pipeline: ${err}`);
|
||||
}
|
||||
}
|
||||
|
||||
function onJobCompleted(): void {
|
||||
cycleState.activeGroupJobs--;
|
||||
|
||||
if (cycleState.activeGroupJobs <= 0) {
|
||||
cycleState.activeGroupJobs = 0;
|
||||
const group = PIPELINE_GROUPS[cycleState.currentGroupIndex];
|
||||
|
||||
if (group?.mode === 'sequential') {
|
||||
cycleState.currentSequentialPhaseIndex++;
|
||||
// Check if there are more phases in this sequential group
|
||||
if (cycleState.currentSequentialPhaseIndex < group.phases.length) {
|
||||
return; // Don't advance group yet
|
||||
}
|
||||
}
|
||||
|
||||
// Advance to next group
|
||||
cycleState.currentGroupIndex++;
|
||||
cycleState.currentSequentialPhaseIndex = 0;
|
||||
log(`Group "${group?.name}" complete, advancing to group ${cycleState.currentGroupIndex + 1}`);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 4: Update startJobProcess callbacks**
|
||||
|
||||
In the `child.on('close')` callback (line 442) and `child.on('error')` callback (line 472), replace `advancePipelinePhase()` with `onJobCompleted()`.
|
||||
|
||||
**Step 5: Update crash recovery**
|
||||
|
||||
In `recoverFromCrash` (lines 259-268), replace the `PIPELINE_PHASES.findIndex` logic with a search through `PIPELINE_GROUPS`:
|
||||
|
||||
```typescript
|
||||
if (lastRunningPipelineJob) {
|
||||
for (let gi = 0; gi < PIPELINE_GROUPS.length; gi++) {
|
||||
const group = PIPELINE_GROUPS[gi];
|
||||
const phaseIdx = group.phases.findIndex(
|
||||
p => p.type === lastRunningPipelineJob.type &&
|
||||
(p.language || null) === (lastRunningPipelineJob.language || null)
|
||||
);
|
||||
if (phaseIdx >= 0) {
|
||||
cycleState.currentGroupIndex = gi;
|
||||
cycleState.currentSequentialPhaseIndex = group.mode === 'sequential' ? phaseIdx : 0;
|
||||
log(`Resuming pipeline from group ${gi + 1}: ${group.name}`);
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 6: Update heartbeat log in main()**
|
||||
|
||||
Replace the heartbeat cron (lines 551-562) and the startup log (lines 574-580) to reference groups instead of phases:
|
||||
|
||||
```typescript
|
||||
cron.schedule('0 * * * *', () => {
|
||||
const currentGroup = cycleState.currentGroupIndex < PIPELINE_GROUPS.length
|
||||
? PIPELINE_GROUPS[cycleState.currentGroupIndex].name
|
||||
: 'none';
|
||||
const jobs = runningJobs.size > 0
|
||||
? `Running: ${[...runningJobs.keys()].join(', ')}`
|
||||
: 'No jobs running';
|
||||
const state = cycleState.waitingForCooldown
|
||||
? 'cooldown'
|
||||
: `group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length} (${currentGroup})`;
|
||||
log(`Heartbeat: Cycle ${cycleState.cycleNumber + 1}, ${state}. ${jobs}`);
|
||||
}, { timezone: 'UTC' });
|
||||
```
|
||||
|
||||
For the startup log:
|
||||
|
||||
```typescript
|
||||
log('=== Scheduler running (parallel grouped pipeline) ===');
|
||||
log(`Pipeline groups (${PIPELINE_GROUPS.length}):`);
|
||||
for (let i = 0; i < PIPELINE_GROUPS.length; i++) {
|
||||
const g = PIPELINE_GROUPS[i];
|
||||
const phaseNames = g.phases.map(p => p.name).join(', ');
|
||||
log(` ${i + 1}. ${g.name} [${g.mode}]: ${phaseNames}`);
|
||||
}
|
||||
```
|
||||
|
||||
**Step 7: Remove dead Google Places env log**
|
||||
|
||||
Delete lines 167-169 (the `GOOGLE_PLACES_API_KEY` log in `validateEnvironment`).
|
||||
|
||||
**Step 8: Verify build**
|
||||
|
||||
Run: `npm run build`
|
||||
Expected: Build succeeds
|
||||
|
||||
**Step 9: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/scheduler.ts
|
||||
git commit -m "feat: parallel grouped pipeline scheduler
|
||||
|
||||
Replace sequential pipeline with grouped phases. Import phases run
|
||||
sequentially, scraper phases run in parallel groups of 3. This reduces
|
||||
cycle time from days to hours. Generic scraper moved to last group."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Increase Scheduler Memory Limit
|
||||
|
||||
**Files:**
|
||||
- Modify: `docker-compose.yml:217-220`
|
||||
|
||||
**Step 1: Increase memory limit**
|
||||
|
||||
Change the scheduler service's `deploy.resources.limits.memory` from `4G` to `10G`:
|
||||
|
||||
```yaml
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 10G
|
||||
```
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add docker-compose.yml
|
||||
git commit -m "chore: increase scheduler memory to 10G for parallel scrapers"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Deploy and Verify
|
||||
|
||||
**Step 1: Deploy to NAS**
|
||||
|
||||
```bash
|
||||
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
|
||||
/Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/
|
||||
```
|
||||
|
||||
**Step 2: Rebuild and restart scheduler**
|
||||
|
||||
```bash
|
||||
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scheduler && /usr/local/bin/docker compose up -d scheduler'
|
||||
```
|
||||
|
||||
**Step 3: Verify logs show parallel groups**
|
||||
|
||||
```bash
|
||||
ssh albert@192.168.0.145 '/usr/local/bin/docker logs --tail 30 scraper-control-scheduler-1'
|
||||
```
|
||||
|
||||
Expected: Logs show "parallel grouped pipeline", group listings with `[parallel]` and `[sequential]` tags, and eventually multiple concurrent `Running:` entries in heartbeat.
|
||||
72
docs/plans/2026-02-26-horariosmisas-spain-design.md
Normal file
72
docs/plans/2026-02-26-horariosmisas-spain-design.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Spain Church Importer (horariosmisas.com) — Design
|
||||
|
||||
## Overview
|
||||
|
||||
Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
|
||||
|
||||
## Data Source
|
||||
|
||||
- **Site:** https://horariosmisas.com
|
||||
- **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
|
||||
- **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
|
||||
- **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass.
|
||||
- **robots.txt:** Fully permissive (`User-agent: * / Disallow:`)
|
||||
- **Sitemaps:** 20 post sitemaps + 7 category sitemaps
|
||||
|
||||
## Architecture
|
||||
|
||||
### Two-Pass Approach
|
||||
|
||||
**Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
|
||||
|
||||
**Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit.
|
||||
|
||||
### Schema Change
|
||||
|
||||
Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers.
|
||||
|
||||
### URL Structure
|
||||
|
||||
- Sitemap index: `/sitemap_index.xml` → 20 post sitemaps
|
||||
- Church pages: `/{province}/{city}/{church-slug}/`
|
||||
- Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc.
|
||||
|
||||
### HTML Parsing
|
||||
|
||||
- **Name:** `<h1>Church Name (City)</h1>` — strip `(City)` suffix
|
||||
- **Address:** `<p>📌 <strong>Street, PostalCode City (Province)</strong></p>`
|
||||
- **Phone:** `<strong>Teléfono:</strong> <a href="tel:...">...</a>`
|
||||
- **Website:** `<strong>Página Web:</strong> <a href="...">...</a>`
|
||||
- **Schedule:** `<table>` with `DÍA`/`HORARIO` columns
|
||||
- Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno`
|
||||
- Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
|
||||
- Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
|
||||
- Day ranges: "Lunes a Viernes" (Monday-Friday)
|
||||
- Time format: `HH:MMh` (24-hour), multiple per cell via `<br>`
|
||||
- Annotations stripped: `(familias)`, etc.
|
||||
|
||||
### Matching Strategy
|
||||
|
||||
1. `horariosMisasId` exact match (for re-imports)
|
||||
2. Name + proximity against existing Spanish churches (from OSM)
|
||||
3. Unmatched: create new church with address, country=ES, no coordinates
|
||||
|
||||
### CLI
|
||||
|
||||
```
|
||||
npx tsx scripts/import-horariosmisas.ts --all
|
||||
npx tsx scripts/import-horariosmisas.ts --all --dry-run
|
||||
npx tsx scripts/import-horariosmisas.ts --province madrid
|
||||
npx tsx scripts/import-horariosmisas.ts --all --geocode
|
||||
npx tsx scripts/import-horariosmisas.ts --geocode-only
|
||||
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
|
||||
- Geocode: 1s between requests (Nominatim public API limit)
|
||||
|
||||
### Scheduler Integration
|
||||
|
||||
Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).
|
||||
322
docs/plans/2026-02-26-horariosmisas-spain.md
Normal file
322
docs/plans/2026-02-26-horariosmisas-spain.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Spain Church Importer (horariosmisas.com) — Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Import ~10,000 Spanish churches with mass schedules from horariosmisas.com, with optional Nominatim forward geocoding for unmatched churches.
|
||||
|
||||
**Architecture:** Sitemap-driven importer. Fetch 20 post sitemaps for church URLs, parse static WordPress HTML for names/addresses/schedule tables, match against existing Spanish OSM churches, upsert with mass schedules. Separate geocoding pass via Nominatim public API.
|
||||
|
||||
**Tech Stack:** TypeScript, Prisma, HTML parsing (regex — no Playwright), Nominatim geocoding API.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Add `horariosMisasId` to Prisma Schema
|
||||
|
||||
**Files:**
|
||||
- Modify: `prisma/schema.prisma`
|
||||
|
||||
**Step 1: Add field and index**
|
||||
|
||||
After the `philmassId` line (around line 38), add:
|
||||
|
||||
```prisma
|
||||
horariosMisasId String? @unique @map("horarios_misas_id") // horariosmisas.com URL slug
|
||||
```
|
||||
|
||||
And add an index in the `@@index` block (around line 78):
|
||||
|
||||
```prisma
|
||||
@@index([horariosMisasId])
|
||||
```
|
||||
|
||||
**Step 2: Push schema to NAS database**
|
||||
|
||||
```bash
|
||||
npx prisma db push --accept-data-loss
|
||||
```
|
||||
|
||||
Expected: `Your database is now in sync with your Prisma schema.`
|
||||
|
||||
**Step 3: Regenerate Prisma client**
|
||||
|
||||
```bash
|
||||
npx prisma generate
|
||||
```
|
||||
|
||||
**Step 4: Push schema to Neon production**
|
||||
|
||||
```bash
|
||||
npx prisma db push --url "$(grep DATABASE_URL .env.production | sed 's/DATABASE_URL="//' | sed 's/"$//')" --accept-data-loss
|
||||
```
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add prisma/schema.prisma
|
||||
git commit -m "feat: add horariosMisasId to Church model for Spain import"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Extend Church Matcher and Existing Importers
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/church-matcher.ts`
|
||||
- Modify: `scripts/import-osm-churches.ts`
|
||||
- Modify: `scripts/import-gcatholic.ts`
|
||||
- Modify: `scripts/import-baidu-churches.ts`
|
||||
- Modify: `scripts/import-osm-region.ts`
|
||||
- Modify: `scripts/import-orarimesse.ts`
|
||||
- Modify: `scripts/import-mass-schedules-ph.ts`
|
||||
- Modify: `scripts/import-philmass.ts`
|
||||
|
||||
### Step 1: Update church-matcher.ts
|
||||
|
||||
In `ExistingChurch` interface (line ~11-26), add after `philmassId`:
|
||||
|
||||
```typescript
|
||||
horariosMisasId: string | null;
|
||||
```
|
||||
|
||||
In `ChurchCandidate` type (line ~113-122), add after `philmassId`:
|
||||
|
||||
```typescript
|
||||
horariosMisasId?: string;
|
||||
```
|
||||
|
||||
In `findDuplicateChurch()`, add a new pass after the fifth pass (philmassId match, line ~169-175). Before the proximity+name pass:
|
||||
|
||||
```typescript
|
||||
// Sixth pass: exact horariosMisasId match
|
||||
if (candidate.horariosMisasId) {
|
||||
const horariosMisasMatch = existingChurches.find(
|
||||
(church) => church.horariosMisasId === candidate.horariosMisasId
|
||||
);
|
||||
if (horariosMisasMatch) return horariosMisasMatch;
|
||||
}
|
||||
```
|
||||
|
||||
Update the comment on the proximity pass to say "Seventh pass".
|
||||
|
||||
### Step 2: Update all existing importers
|
||||
|
||||
In every importer that queries churches with a `select` clause containing `philmassId: true`, add:
|
||||
|
||||
```typescript
|
||||
horariosMisasId: true,
|
||||
```
|
||||
|
||||
In every importer that creates/pushes churches with `philmassId: null`, add:
|
||||
|
||||
```typescript
|
||||
horariosMisasId: null,
|
||||
```
|
||||
|
||||
**Files to update:** `import-osm-churches.ts`, `import-gcatholic.ts`, `import-baidu-churches.ts`, `import-osm-region.ts`, `import-orarimesse.ts`, `import-mass-schedules-ph.ts`, `import-philmass.ts`
|
||||
|
||||
### Step 3: Verify build
|
||||
|
||||
```bash
|
||||
npx tsc --noEmit
|
||||
```
|
||||
|
||||
Expected: No errors.
|
||||
|
||||
### Step 4: Commit
|
||||
|
||||
```bash
|
||||
git add src/lib/church-matcher.ts scripts/import-*.ts
|
||||
git commit -m "feat: add horariosMisasId to church matcher and all importers"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Create `import-horariosmisas.ts`
|
||||
|
||||
**Files:**
|
||||
- Create: `scripts/import-horariosmisas.ts`
|
||||
|
||||
### Architecture
|
||||
|
||||
This importer follows the exact same structure as `scripts/import-mass-schedules-ph.ts`. Key differences:
|
||||
|
||||
- **Sitemap:** Fetches 20 post sitemaps from sitemap index (not a single sitemap)
|
||||
- **URL filtering:** Church URLs have 3 path segments (`/{province}/{city}/{slug}/`). Non-church URLs (blog posts, daily readings) are filtered out.
|
||||
- **Schedule parsing:** Two seasonal tables (summer/winter). Import seasonally appropriate one based on current month.
|
||||
- **Day names:** Spanish (`Lunes`, `Martes`, etc.) with range support (`Lunes a Viernes`)
|
||||
- **Times:** 24-hour `HH:MMh` format (e.g., `08:00h`, `20:30h`)
|
||||
- **No coordinates:** Churches created with `latitude: 0, longitude: 0` — geocoded separately
|
||||
- **Geocoding:** Optional `--geocode` flag uses Nominatim public API (1 req/sec)
|
||||
|
||||
### Constants
|
||||
|
||||
```typescript
|
||||
const SITE_BASE = 'https://horariosmisas.com';
|
||||
const SITEMAP_INDEX_URL = `${SITE_BASE}/sitemap_index.xml`;
|
||||
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
|
||||
const REQUEST_DELAY_MS = 1500;
|
||||
const NOMINATIM_DELAY_MS = 1100;
|
||||
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';
|
||||
```
|
||||
|
||||
### Spanish Day Mapping
|
||||
|
||||
```typescript
|
||||
const DAY_MAP: Record<string, number[]> = {
|
||||
'domingos y festivos': [0],
|
||||
'domingos': [0],
|
||||
'domingo': [0],
|
||||
'lunes': [1],
|
||||
'martes': [2],
|
||||
'miércoles': [3],
|
||||
'miercoles': [3],
|
||||
'jueves': [4],
|
||||
'viernes': [5],
|
||||
'sábado': [6],
|
||||
'sabado': [6],
|
||||
'sábados': [6],
|
||||
'sabados': [6],
|
||||
};
|
||||
```
|
||||
|
||||
### Sitemap Fetching
|
||||
|
||||
1. Fetch sitemap index → extract `post-sitemap*.xml` URLs
|
||||
2. Fetch each post sitemap → extract URLs with exactly 3 path segments
|
||||
3. Filter out non-church URLs (patterns: `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, `/noticias/`, `/blog/`, `/contacto/`, `/aviso-legal/`, `/politica-de-privacidad/`, `/politica-de-cookies/`)
|
||||
4. Deduplicate by slug
|
||||
|
||||
### HTML Parsing
|
||||
|
||||
**Church name:** `<h1>Church Name (City)</h1>` → strip `(City)` suffix
|
||||
|
||||
**Address:** `📌 <strong>Calle Goya, 26 28001 Madrid (Madrid)</strong>` → extract street, postal code (5-digit `\b\d{5}\b`), city (text after postal code), strip `(Province)` suffix
|
||||
|
||||
**Phone:** `<strong>Teléfono:</strong> <a href="tel:...">number</a>`
|
||||
|
||||
**Website:** `<strong>Página Web:</strong> <a href="url">...</a>`
|
||||
|
||||
**Schedule tables:** Find `<table>` elements with DÍA/HORARIO headers. Split by seasonal headings (☀️ verano / ⛄ invierno). Pick seasonally appropriate section (Oct-May = winter, Jun-Sep = summer). Parse `<td>` cells: first cell = day name(s), second cell = times. Times in `HH:MMh` format extracted via regex `(\d{1,2}):(\d{2})\s*h?`.
|
||||
|
||||
### Day Range Resolution
|
||||
|
||||
Support ranges like `Lunes a Viernes` → [1,2,3,4,5] and compound entries like `Lunes, Miércoles y Viernes` → [1,3,5].
|
||||
|
||||
### Geocoding (--geocode / --geocode-only)
|
||||
|
||||
Query Nominatim with: `{address}, Spain` → fallback to `{postalCode} {city}, Spain` → fallback to `{city}, Spain`. Use `countrycodes=es` parameter. Max 1 req/sec.
|
||||
|
||||
### Matching Strategy
|
||||
|
||||
1. `horariosMisasId` exact match (primary — for re-imports)
|
||||
2. Name + proximity against existing Spanish OSM churches (secondary)
|
||||
3. Unmatched: create new church with `latitude: 0, longitude: 0`, country=ES
|
||||
|
||||
### CLI
|
||||
|
||||
```
|
||||
--all Import all churches from sitemaps
|
||||
--province <name> Import only churches from this province
|
||||
--dry-run No database writes
|
||||
--geocode After import, geocode unmatched churches
|
||||
--geocode-only Only geocode (skip import)
|
||||
--resume-from <n> Skip first N churches
|
||||
--job-id <uuid> Background job tracking
|
||||
```
|
||||
|
||||
### Mass Schedule Language
|
||||
|
||||
Set `language: 'Spanish'` on all created mass schedules.
|
||||
|
||||
### Step 1: Create the file
|
||||
|
||||
Use `scripts/import-mass-schedules-ph.ts` as the structural template. Implement all functions described above.
|
||||
|
||||
### Step 2: Verify build
|
||||
|
||||
```bash
|
||||
npx tsc --noEmit
|
||||
```
|
||||
|
||||
### Step 3: Dry-run test
|
||||
|
||||
```bash
|
||||
npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run
|
||||
```
|
||||
|
||||
### Step 4: Commit
|
||||
|
||||
```bash
|
||||
git add scripts/import-horariosmisas.ts
|
||||
git commit -m "feat: add horariosmisas.com Spain church importer"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Add to Scheduler Pipeline and npm Scripts
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/scheduler.ts`
|
||||
- Modify: `package.json`
|
||||
|
||||
### Step 1: Add to PIPELINE_GROUPS
|
||||
|
||||
In `scripts/scheduler.ts`, in the `imports` group (line ~40-51), add after the `philmass-import` entry:
|
||||
|
||||
```typescript
|
||||
{ name: 'horariosmisas-import', type: 'horariosmisas-import', config: {} },
|
||||
```
|
||||
|
||||
### Step 2: Add getJobCommand case
|
||||
|
||||
In the `getJobCommand` function (around line ~182), before the `default:` case, add:
|
||||
|
||||
```typescript
|
||||
case 'horariosmisas-import': {
|
||||
const args = ['tsx', 'scripts/import-horariosmisas.ts', '--all', '--geocode'];
|
||||
if (config?.province) args.push('--province', String(config.province));
|
||||
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
||||
return { command: 'npx', args };
|
||||
}
|
||||
```
|
||||
|
||||
### Step 3: Add npm scripts
|
||||
|
||||
In `package.json`, add after the `"import:philmass"` line:
|
||||
|
||||
```json
|
||||
"import:horariosmisas": "tsx scripts/import-horariosmisas.ts",
|
||||
```
|
||||
|
||||
### Step 4: Verify build
|
||||
|
||||
```bash
|
||||
npx tsc --noEmit
|
||||
```
|
||||
|
||||
### Step 5: Commit
|
||||
|
||||
```bash
|
||||
git add scripts/scheduler.ts package.json
|
||||
git commit -m "feat: add horariosmisas import to scheduler pipeline"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
1. **Dry run on single province**: `npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run`
|
||||
- Verify: church names parsed correctly, schedules extracted, matches found
|
||||
2. **Dry run on Madrid**: `npx tsx scripts/import-horariosmisas.ts --province madrid --dry-run`
|
||||
- Verify: larger province, summer/winter schedule selection, address parsing
|
||||
3. **Single province real import**: `npx tsx scripts/import-horariosmisas.ts --province navarra`
|
||||
- Verify: churches created/updated, mass schedules in database
|
||||
4. **Geocode test**: `npx tsx scripts/import-horariosmisas.ts --geocode-only --dry-run`
|
||||
- Verify: finds churches needing geocoding, Nominatim returns coordinates
|
||||
5. **Full import**: `npx tsx scripts/import-horariosmisas.ts --all --geocode`
|
||||
|
||||
## Runtime Estimate
|
||||
|
||||
- Sitemap fetch: 20 sitemaps x 1.5s = ~30s
|
||||
- Import: ~10,000 churches x 1.5s = ~4.2 hours
|
||||
- Geocode: depends on unmatched count x 1.1s
|
||||
103
docs/plans/2026-03-01-weekdaymasses-importer-design.md
Normal file
103
docs/plans/2026-03-01-weekdaymasses-importer-design.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# weekdaymasses.org.uk Global Importer
|
||||
|
||||
## Context
|
||||
|
||||
weekdaymasses.org.uk is a UK-based Catholic directory covering ~3,500-4,000 churches globally with mass schedules, coordinates, addresses, and phone numbers. Covers GB, Ireland, and 49+ international countries (India, Sri Lanka, South Korea, Japan, and more). All data served on single HTML pages per area — no pagination or API needed.
|
||||
|
||||
## Data Source
|
||||
|
||||
Three area pages cover the entire site:
|
||||
|
||||
| Page | URL | Est. Churches |
|
||||
|------|-----|---------------|
|
||||
| GB | `/en/area/gb/churches` | ~3,000+ |
|
||||
| Ireland | `/en/area/ireland/churches` | ~300+ |
|
||||
| Outside GB | `/en/area/outside-gb/churches` | ~152+ |
|
||||
|
||||
Individual country/region pages (e.g. `/en/area/india/churches`) are subsets of these three.
|
||||
|
||||
### Data per church
|
||||
|
||||
- **Name**: h3 heading, format "Church Name (Location)"
|
||||
- **Address**: plain text after mass times, with postal/zip code
|
||||
- **Coordinates**: in map link query params `lat=XX.XXXX&lon=YY.YYYY&church_id=NNNNN`
|
||||
- **Mass times**: format `Day: HH.MMam/pm(Language), HH.MMam/pm(Language)`
|
||||
- **Phone**: `Tel: +XX XXXX XXXXXX`
|
||||
- **Website**: occasional links
|
||||
- **church_id**: unique numeric identifier in map links
|
||||
|
||||
### Mass time format
|
||||
|
||||
```
|
||||
Sunday: 6.30am(Tamil), 8.30am(Tamil), 5.30pm(English)
|
||||
Mon Tue Wed Thu Fri: 6.30am(Tamil)
|
||||
Saturday: 6.30am(Tamil), 5.30pm(English)
|
||||
```
|
||||
|
||||
Day labels: `Sunday`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, `Saturday`, or combinations like `Mon Tue Wed Thu Fri`. Also `Holy Day` entries.
|
||||
|
||||
Time format: `H.MMam/pm` — needs conversion to 24h `HH:MM`.
|
||||
|
||||
Language in parentheses maps to our `language` field on mass_schedules.
|
||||
|
||||
### Country detection
|
||||
|
||||
The address is the last line of each church entry. Country can be detected by:
|
||||
- GB: UK postal code pattern (e.g. `SW1A 1AA`)
|
||||
- Ireland: Irish Eircode (e.g. `D01 F5P2`) or "Ireland" in address
|
||||
- India: 6-digit postal code (e.g. `600088`)
|
||||
- Others: country name at end of address, or fallback to the area page being scraped
|
||||
|
||||
## Design
|
||||
|
||||
### Schema
|
||||
|
||||
Add to Church model in both BethelGuide and ScraperControl:
|
||||
|
||||
```prisma
|
||||
weekdayMassesId String? @unique @map("weekday_masses_id")
|
||||
@@index([weekdayMassesId])
|
||||
```
|
||||
|
||||
### Script: `scripts/import-weekdaymasses.ts`
|
||||
|
||||
Single script that:
|
||||
|
||||
1. Fetches area pages (default: all 3; filterable with `--area gb|ireland|outside-gb|india|...`)
|
||||
2. Parses HTML into structured church entries
|
||||
3. Converts mass times from `H.MMam/pm` to `HH:MM` 24h format
|
||||
4. Detects country from address patterns
|
||||
5. Matches against existing churches by `weekdayMassesId` (exact) then proximity+name
|
||||
6. Upserts churches and replaces mass schedules
|
||||
|
||||
### HTML parsing strategy
|
||||
|
||||
Each church is a block between consecutive h3 headings. Within each block:
|
||||
- h3 content = church name
|
||||
- Lines with day labels + times = mass schedule
|
||||
- Map link = coordinates + church_id
|
||||
- Last text block before next h3 = address
|
||||
- `Tel:` prefix = phone
|
||||
|
||||
### CLI flags
|
||||
|
||||
- `--all` — import all 3 area pages
|
||||
- `--area <name>` — import specific area (gb, ireland, outside-gb, india, sri-lanka, etc.)
|
||||
- `--dry-run` — no database writes
|
||||
- `--resume-from <n>` — skip first N churches
|
||||
- `--job-id <uuid>` — background job tracking
|
||||
|
||||
### Church matcher integration
|
||||
|
||||
Add `weekdayMassesId` to `ExistingChurch`, `ChurchCandidate`, and a new match pass in `findDuplicateChurch()`.
|
||||
|
||||
### Scheduler integration
|
||||
|
||||
Add `weekdaymasses-import` to the sequential imports group in the pipeline, with `getJobCommand()` case and npm script.
|
||||
|
||||
## Scope
|
||||
|
||||
- ~3,500-4,000 churches with mass schedules
|
||||
- Most GB/Ireland churches already in DB from OSM (will match and add schedules)
|
||||
- India/Sri Lanka/international churches partially in DB from OSM/gcatholic
|
||||
- Value: mass schedule data for thousands of churches that currently have none
|
||||
Reference in New Issue
Block a user