Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
323 lines
9.7 KiB
Markdown
323 lines
9.7 KiB
Markdown
# Spain Church Importer (horariosmisas.com) — Implementation Plan
|
|
|
|
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
|
|
|
**Goal:** Import ~10,000 Spanish churches with mass schedules from horariosmisas.com, with optional Nominatim forward geocoding for unmatched churches.
|
|
|
|
**Architecture:** Sitemap-driven importer. Fetch 20 post sitemaps for church URLs, parse static WordPress HTML for names/addresses/schedule tables, match against existing Spanish OSM churches, upsert with mass schedules. Separate geocoding pass via Nominatim public API.
|
|
|
|
**Tech Stack:** TypeScript, Prisma, HTML parsing (regex — no Playwright), Nominatim geocoding API.
|
|
|
|
---
|
|
|
|
## Task 1: Add `horariosMisasId` to Prisma Schema
|
|
|
|
**Files:**
|
|
- Modify: `prisma/schema.prisma`
|
|
|
|
**Step 1: Add field and index**
|
|
|
|
After the `philmassId` line (around line 38), add:
|
|
|
|
```prisma
|
|
horariosMisasId String? @unique @map("horarios_misas_id") // horariosmisas.com URL slug
|
|
```
|
|
|
|
And add an index in the `@@index` block (around line 78):
|
|
|
|
```prisma
|
|
@@index([horariosMisasId])
|
|
```
|
|
|
|
**Step 2: Push schema to NAS database**
|
|
|
|
```bash
|
|
npx prisma db push --accept-data-loss
|
|
```
|
|
|
|
Expected: `Your database is now in sync with your Prisma schema.`
|
|
|
|
**Step 3: Regenerate Prisma client**
|
|
|
|
```bash
|
|
npx prisma generate
|
|
```
|
|
|
|
**Step 4: Push schema to Neon production**
|
|
|
|
```bash
|
|
npx prisma db push --url "$(grep DATABASE_URL .env.production | sed 's/DATABASE_URL="//' | sed 's/"$//')" --accept-data-loss
|
|
```
|
|
|
|
**Step 5: Commit**
|
|
|
|
```bash
|
|
git add prisma/schema.prisma
|
|
git commit -m "feat: add horariosMisasId to Church model for Spain import"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 2: Extend Church Matcher and Existing Importers
|
|
|
|
**Files:**
|
|
- Modify: `src/lib/church-matcher.ts`
|
|
- Modify: `scripts/import-osm-churches.ts`
|
|
- Modify: `scripts/import-gcatholic.ts`
|
|
- Modify: `scripts/import-baidu-churches.ts`
|
|
- Modify: `scripts/import-osm-region.ts`
|
|
- Modify: `scripts/import-orarimesse.ts`
|
|
- Modify: `scripts/import-mass-schedules-ph.ts`
|
|
- Modify: `scripts/import-philmass.ts`
|
|
|
|
### Step 1: Update church-matcher.ts
|
|
|
|
In `ExistingChurch` interface (line ~11-26), add after `philmassId`:
|
|
|
|
```typescript
|
|
horariosMisasId: string | null;
|
|
```
|
|
|
|
In `ChurchCandidate` type (line ~113-122), add after `philmassId`:
|
|
|
|
```typescript
|
|
horariosMisasId?: string;
|
|
```
|
|
|
|
In `findDuplicateChurch()`, add a new pass after the fifth pass (philmassId match, line ~169-175). Before the proximity+name pass:
|
|
|
|
```typescript
|
|
// Sixth pass: exact horariosMisasId match
|
|
if (candidate.horariosMisasId) {
|
|
const horariosMisasMatch = existingChurches.find(
|
|
(church) => church.horariosMisasId === candidate.horariosMisasId
|
|
);
|
|
if (horariosMisasMatch) return horariosMisasMatch;
|
|
}
|
|
```
|
|
|
|
Update the comment on the proximity pass to say "Seventh pass".
|
|
|
|
### Step 2: Update all existing importers
|
|
|
|
In every importer that queries churches with a `select` clause containing `philmassId: true`, add:
|
|
|
|
```typescript
|
|
horariosMisasId: true,
|
|
```
|
|
|
|
In every importer that creates/pushes churches with `philmassId: null`, add:
|
|
|
|
```typescript
|
|
horariosMisasId: null,
|
|
```
|
|
|
|
**Files to update:** `import-osm-churches.ts`, `import-gcatholic.ts`, `import-baidu-churches.ts`, `import-osm-region.ts`, `import-orarimesse.ts`, `import-mass-schedules-ph.ts`, `import-philmass.ts`
|
|
|
|
### Step 3: Verify build
|
|
|
|
```bash
|
|
npx tsc --noEmit
|
|
```
|
|
|
|
Expected: No errors.
|
|
|
|
### Step 4: Commit
|
|
|
|
```bash
|
|
git add src/lib/church-matcher.ts scripts/import-*.ts
|
|
git commit -m "feat: add horariosMisasId to church matcher and all importers"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 3: Create `import-horariosmisas.ts`
|
|
|
|
**Files:**
|
|
- Create: `scripts/import-horariosmisas.ts`
|
|
|
|
### Architecture
|
|
|
|
This importer follows the exact same structure as `scripts/import-mass-schedules-ph.ts`. Key differences:
|
|
|
|
- **Sitemap:** Fetches 20 post sitemaps from sitemap index (not a single sitemap)
|
|
- **URL filtering:** Church URLs have 3 path segments (`/{province}/{city}/{slug}/`). Non-church URLs (blog posts, daily readings) are filtered out.
|
|
- **Schedule parsing:** Two seasonal tables (summer/winter). Import seasonally appropriate one based on current month.
|
|
- **Day names:** Spanish (`Lunes`, `Martes`, etc.) with range support (`Lunes a Viernes`)
|
|
- **Times:** 24-hour `HH:MMh` format (e.g., `08:00h`, `20:30h`)
|
|
- **No coordinates:** Churches created with `latitude: 0, longitude: 0` — geocoded separately
|
|
- **Geocoding:** Optional `--geocode` flag uses Nominatim public API (1 req/sec)
|
|
|
|
### Constants
|
|
|
|
```typescript
|
|
const SITE_BASE = 'https://horariosmisas.com';
|
|
const SITEMAP_INDEX_URL = `${SITE_BASE}/sitemap_index.xml`;
|
|
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
|
|
const REQUEST_DELAY_MS = 1500;
|
|
const NOMINATIM_DELAY_MS = 1100;
|
|
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';
|
|
```
|
|
|
|
### Spanish Day Mapping
|
|
|
|
```typescript
|
|
const DAY_MAP: Record<string, number[]> = {
|
|
'domingos y festivos': [0],
|
|
'domingos': [0],
|
|
'domingo': [0],
|
|
'lunes': [1],
|
|
'martes': [2],
|
|
'miércoles': [3],
|
|
'miercoles': [3],
|
|
'jueves': [4],
|
|
'viernes': [5],
|
|
'sábado': [6],
|
|
'sabado': [6],
|
|
'sábados': [6],
|
|
'sabados': [6],
|
|
};
|
|
```
|
|
|
|
### Sitemap Fetching
|
|
|
|
1. Fetch sitemap index → extract `post-sitemap*.xml` URLs
|
|
2. Fetch each post sitemap → extract URLs with exactly 3 path segments
|
|
3. Filter out non-church URLs (patterns: `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, `/noticias/`, `/blog/`, `/contacto/`, `/aviso-legal/`, `/politica-de-privacidad/`, `/politica-de-cookies/`)
|
|
4. Deduplicate by slug
|
|
|
|
### HTML Parsing
|
|
|
|
**Church name:** `<h1>Church Name (City)</h1>` → strip `(City)` suffix
|
|
|
|
**Address:** `📌 <strong>Calle Goya, 26 28001 Madrid (Madrid)</strong>` → extract street, postal code (5-digit `\b\d{5}\b`), city (text after postal code), strip `(Province)` suffix
|
|
|
|
**Phone:** `<strong>Teléfono:</strong> <a href="tel:...">number</a>`
|
|
|
|
**Website:** `<strong>Página Web:</strong> <a href="url">...</a>`
|
|
|
|
**Schedule tables:** Find `<table>` elements with DÍA/HORARIO headers. Split by seasonal headings (☀️ verano / ⛄ invierno). Pick seasonally appropriate section (Oct-May = winter, Jun-Sep = summer). Parse `<td>` cells: first cell = day name(s), second cell = times. Times in `HH:MMh` format extracted via regex `(\d{1,2}):(\d{2})\s*h?`.
|
|
|
|
### Day Range Resolution
|
|
|
|
Support ranges like `Lunes a Viernes` → [1,2,3,4,5] and compound entries like `Lunes, Miércoles y Viernes` → [1,3,5].
|
|
|
|
### Geocoding (--geocode / --geocode-only)
|
|
|
|
Query Nominatim with: `{address}, Spain` → fallback to `{postalCode} {city}, Spain` → fallback to `{city}, Spain`. Use `countrycodes=es` parameter. Max 1 req/sec.
|
|
|
|
### Matching Strategy
|
|
|
|
1. `horariosMisasId` exact match (primary — for re-imports)
|
|
2. Name + proximity against existing Spanish OSM churches (secondary)
|
|
3. Unmatched: create new church with `latitude: 0, longitude: 0`, country=ES
|
|
|
|
### CLI
|
|
|
|
```
|
|
--all Import all churches from sitemaps
|
|
--province <name> Import only churches from this province
|
|
--dry-run No database writes
|
|
--geocode After import, geocode unmatched churches
|
|
--geocode-only Only geocode (skip import)
|
|
--resume-from <n> Skip first N churches
|
|
--job-id <uuid> Background job tracking
|
|
```
|
|
|
|
### Mass Schedule Language
|
|
|
|
Set `language: 'Spanish'` on all created mass schedules.
|
|
|
|
### Step 1: Create the file
|
|
|
|
Use `scripts/import-mass-schedules-ph.ts` as the structural template. Implement all functions described above.
|
|
|
|
### Step 2: Verify build
|
|
|
|
```bash
|
|
npx tsc --noEmit
|
|
```
|
|
|
|
### Step 3: Dry-run test
|
|
|
|
```bash
|
|
npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run
|
|
```
|
|
|
|
### Step 4: Commit
|
|
|
|
```bash
|
|
git add scripts/import-horariosmisas.ts
|
|
git commit -m "feat: add horariosmisas.com Spain church importer"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 4: Add to Scheduler Pipeline and npm Scripts
|
|
|
|
**Files:**
|
|
- Modify: `scripts/scheduler.ts`
|
|
- Modify: `package.json`
|
|
|
|
### Step 1: Add to PIPELINE_GROUPS
|
|
|
|
In `scripts/scheduler.ts`, in the `imports` group (line ~40-51), add after the `philmass-import` entry:
|
|
|
|
```typescript
|
|
{ name: 'horariosmisas-import', type: 'horariosmisas-import', config: {} },
|
|
```
|
|
|
|
### Step 2: Add getJobCommand case
|
|
|
|
In the `getJobCommand` function (around line ~182), before the `default:` case, add:
|
|
|
|
```typescript
|
|
case 'horariosmisas-import': {
|
|
const args = ['tsx', 'scripts/import-horariosmisas.ts', '--all', '--geocode'];
|
|
if (config?.province) args.push('--province', String(config.province));
|
|
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
|
return { command: 'npx', args };
|
|
}
|
|
```
|
|
|
|
### Step 3: Add npm scripts
|
|
|
|
In `package.json`, add after the `"import:philmass"` line:
|
|
|
|
```json
|
|
"import:horariosmisas": "tsx scripts/import-horariosmisas.ts",
|
|
```
|
|
|
|
### Step 4: Verify build
|
|
|
|
```bash
|
|
npx tsc --noEmit
|
|
```
|
|
|
|
### Step 5: Commit
|
|
|
|
```bash
|
|
git add scripts/scheduler.ts package.json
|
|
git commit -m "feat: add horariosmisas import to scheduler pipeline"
|
|
```
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
1. **Dry run on single province**: `npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run`
|
|
- Verify: church names parsed correctly, schedules extracted, matches found
|
|
2. **Dry run on Madrid**: `npx tsx scripts/import-horariosmisas.ts --province madrid --dry-run`
|
|
- Verify: larger province, summer/winter schedule selection, address parsing
|
|
3. **Single province real import**: `npx tsx scripts/import-horariosmisas.ts --province navarra`
|
|
- Verify: churches created/updated, mass schedules in database
|
|
4. **Geocode test**: `npx tsx scripts/import-horariosmisas.ts --geocode-only --dry-run`
|
|
- Verify: finds churches needing geocoding, Nominatim returns coordinates
|
|
5. **Full import**: `npx tsx scripts/import-horariosmisas.ts --all --geocode`
|
|
|
|
## Runtime Estimate
|
|
|
|
- Sitemap fetch: 20 sitemaps x 1.5s = ~30s
|
|
- Import: ~10,000 churches x 1.5s = ~4.2 hours
|
|
- Geocode: depends on unmatched count x 1.1s
|