docs: add Brazil + Spain importers design spec and implementation plan

Two new importers:
- horariodemissa.com.br: 8,895 Brazilian churches + 28,523 mass times
- misas.org: 17,919 Spanish churches with coordinates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
albertfj114
2026-03-10 19:50:54 -04:00
commit 0e468bcb94
2 changed files with 1575 additions and 0 deletions

View File

@@ -0,0 +1,192 @@
# Design: Brazil (horariodemissa.com.br) + Spain (misas.org) Importers
## Overview
Two parallel importers targeting the highest-value uncovered regions:
- **Brazil** — zero current coverage, 8,895 churches + 28,523 mass times
- **Spain supplement** — 17,919 churches with coordinates (fills gaps vs horariosmisas.com's ~10,000)
---
## Importer 1: import-horariodemissa.ts (Brazil)
### Source
- **Site**: https://horariodemissa.com.br
- **Coverage**: All 26 Brazilian states + DF
- **Data**: 8,895 churches, 28,523 mass times (server-rendered, no auth needed)
- **robots.txt**: Only disallows `/404.php` — fully permissive
### Enumeration Strategy
Fetch `https://horariodemissa.com.br/sitemap.xml` → extract unique city URLs filtered to `hl=pt` only (~3,552 unique cities). URL pattern:
```
https://horariodemissa.com.br/search.php?uf={STATE}&cidade={CITY}&bairro=&opcoes=cidade_opcoes&submit=12345678&hl=pt
```
### HTML Parsing
Each city page contains `.result` divs (server-rendered). Per church:
- **Key**: `href` of `.result_title` link → `igreja.php?k=XXXXX` (alphanumeric, used as `horarioDemissaId`)
- **Name**: `.result_title` link text
- **Address**: text node after the `<br/>` in the first `<p>` within `.result`
- **Phone**: `<p>` containing `Telefone:`
- **Mass schedule**: first `<table>` — rows with `<td style="text-align:right;font-weight:bold;">DAY:</td><td>TIMES</td>`
- **Confession schedule**: second `<table>` (same structure, times as ranges `HH:MM às HH:MM`)
### Day Name Mapping
| Portuguese | dayOfWeek |
|---|---|
| Domingo | 0 (Sunday) |
| Segunda-feira | 1 (Monday) |
| Terça-feira | 2 (Tuesday) |
| Quarta-feira | 3 (Wednesday) |
| Quinta-feira | 4 (Thursday) |
| Sexta-feira | 5 (Friday) |
| Sábado | 6 (Saturday) |
| Primeiro Sábado | 6, notes="Primeiro Sábado" |
| Segundo Domingo | 0, notes="Segundo Domingo" |
Time format: `HH:MM` (24h, already in correct format). Multiple times comma-separated.
Notes in parentheses e.g. `(Forma Extraordinária do Rito Romano)` → strip and store as `massType` or `notes`.
### Matching Strategy
1. `horarioDemissaId` exact match (for re-runs)
2. Name + proximity (200m) against existing BR churches (some may exist from OSM)
3. Unmatched: create new church, country=BR, no coordinates
### Schema Addition
```prisma
horarioDemissaId String? @unique @map("horario_demissa_id")
@@index([horarioDemissaId])
```
### CLI
```bash
npx tsx scripts/import-horariodemissa.ts --all
npx tsx scripts/import-horariodemissa.ts --all --dry-run
npx tsx scripts/import-horariodemissa.ts --state SP
npx tsx scripts/import-horariodemissa.ts --all --resume-from 500
npx tsx scripts/import-horariodemissa.ts --all --geocode # Nominatim pass
npx tsx scripts/import-horariodemissa.ts --geocode-only
npx tsx scripts/import-horariodemissa.ts --all --job-id {uuid}
```
### Rate Limiting
- City pages: 1.5s between requests (~3,552 × 1.5s ≈ 1.5 hours)
- Geocode (optional): 1.1s between Nominatim requests
---
## Importer 2: import-misas.ts (Spain)
### Source
- **Site**: https://misas.org
- **Coverage**: Spain only (despite claiming LatAm — API returns 0 for MX/AR/CO)
- **Data**: 17,919 churches with coordinates, name, address, province, zip
- **No mass schedules**: detail API returns 401 — church directory only
### API
```
GET https://misas.org/api/parishsearch?country=es&pos=[-3.7038,40.4168,999999]&offset=0&limit=500
```
Response:
```json
{
"count": 17919,
"pars": [
{
"id": 16604,
"name": "Parròquia de Sant Lliser",
"uri": "parroquia-de-sant-lliser-alos-disil",
"addr": "Carrer Bonabe, 4",
"loc": "Alòs d'Isil",
"prov": "Lérida",
"zip": "25586",
"lat": "42.701074",
"long": "1.100028"
}
]
}
```
### Enumeration Strategy
Paginate with `offset` in steps of 500 until all 17,919 churches fetched (~36 requests). Use Madrid center coordinates with radius=999999 to cover all of Spain.
### Matching Strategy
1. `misasOrgId` exact match (for re-runs)
2. Name + proximity (200m) against existing ES churches
3. Unmatched: create new church with coordinates, country=ES
No mass schedules written — church record only.
### Schema Addition
```prisma
misasOrgId String? @unique @map("misas_org_id")
@@index([misasOrgId])
```
### CLI
```bash
npx tsx scripts/import-misas.ts --all
npx tsx scripts/import-misas.ts --all --dry-run
npx tsx scripts/import-misas.ts --all --resume-from 5000
npx tsx scripts/import-misas.ts --all --job-id {uuid}
```
### Rate Limiting
- API pagination: 500ms between requests (~36 calls, minimal impact)
---
## Shared Implementation Patterns
Both scripts follow the standard importer pattern:
```typescript
// DB setup
dotenv.config(...)
const pool = new Pool({ connectionString: DATABASE_URL })
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) })
// church-matcher integration
import { findDuplicateChurch } from '../src/lib/church-matcher'
// ExistingChurch interface gets new ID fields added
// Standard flags
--all, --dry-run, --resume-from N, --job-id UUID
// Stats output
{ total, created, updated, skipped, errors }
```
Both added to:
- `package.json` scripts
- Scheduler pipeline (sequential imports group)
- `church-matcher.ts` ExistingChurch interface
---
## Estimated Scale
| | Brazil | Spain |
|---|---|---|
| Churches | 8,895 (all new) | 17,919 (~7,000 new vs horariosmisas) |
| Mass times | 28,523 | 0 (no schedule access) |
| Runtime | ~1.5h | ~5 min |
| Coordinates | No (address only) | Yes |