Two new importers: - horariodemissa.com.br: 8,895 Brazilian churches + 28,523 mass times - misas.org: 17,919 Spanish churches with coordinates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
193 lines
5.4 KiB
Markdown
193 lines
5.4 KiB
Markdown
# Design: Brazil (horariodemissa.com.br) + Spain (misas.org) Importers
|
||
|
||
## Overview
|
||
|
||
Two parallel importers targeting the highest-value uncovered regions:
|
||
- **Brazil** — zero current coverage, 8,895 churches + 28,523 mass times
|
||
- **Spain supplement** — 17,919 churches with coordinates (fills gaps vs horariosmisas.com's ~10,000)
|
||
|
||
---
|
||
|
||
## Importer 1: import-horariodemissa.ts (Brazil)
|
||
|
||
### Source
|
||
|
||
- **Site**: https://horariodemissa.com.br
|
||
- **Coverage**: All 26 Brazilian states + DF
|
||
- **Data**: 8,895 churches, 28,523 mass times (server-rendered, no auth needed)
|
||
- **robots.txt**: Only disallows `/404.php` — fully permissive
|
||
|
||
### Enumeration Strategy
|
||
|
||
Fetch `https://horariodemissa.com.br/sitemap.xml` → extract unique city URLs filtered to `hl=pt` only (~3,552 unique cities). URL pattern:
|
||
|
||
```
|
||
https://horariodemissa.com.br/search.php?uf={STATE}&cidade={CITY}&bairro=&opcoes=cidade_opcoes&submit=12345678&hl=pt
|
||
```
|
||
|
||
### HTML Parsing
|
||
|
||
Each city page contains `.result` divs (server-rendered). Per church:
|
||
|
||
- **Key**: `href` of `.result_title` link → `igreja.php?k=XXXXX` (alphanumeric, used as `horarioDemissaId`)
|
||
- **Name**: `.result_title` link text
|
||
- **Address**: text node after the `<br/>` in the first `<p>` within `.result`
|
||
- **Phone**: `<p>` containing `Telefone:`
|
||
- **Mass schedule**: first `<table>` — rows with `<td style="text-align:right;font-weight:bold;">DAY:</td><td>TIMES</td>`
|
||
- **Confession schedule**: second `<table>` (same structure, times as ranges `HH:MM às HH:MM`)
|
||
|
||
### Day Name Mapping
|
||
|
||
| Portuguese | dayOfWeek |
|
||
|---|---|
|
||
| Domingo | 0 (Sunday) |
|
||
| Segunda-feira | 1 (Monday) |
|
||
| Terça-feira | 2 (Tuesday) |
|
||
| Quarta-feira | 3 (Wednesday) |
|
||
| Quinta-feira | 4 (Thursday) |
|
||
| Sexta-feira | 5 (Friday) |
|
||
| Sábado | 6 (Saturday) |
|
||
| Primeiro Sábado | 6, notes="Primeiro Sábado" |
|
||
| Segundo Domingo | 0, notes="Segundo Domingo" |
|
||
|
||
Time format: `HH:MM` (24h, already in correct format). Multiple times comma-separated.
|
||
Notes in parentheses e.g. `(Forma Extraordinária do Rito Romano)` → strip and store as `massType` or `notes`.
|
||
|
||
### Matching Strategy
|
||
|
||
1. `horarioDemissaId` exact match (for re-runs)
|
||
2. Name + proximity (200m) against existing BR churches (some may exist from OSM)
|
||
3. Unmatched: create new church, country=BR, no coordinates
|
||
|
||
### Schema Addition
|
||
|
||
```prisma
|
||
horarioDemissaId String? @unique @map("horario_demissa_id")
|
||
@@index([horarioDemissaId])
|
||
```
|
||
|
||
### CLI
|
||
|
||
```bash
|
||
npx tsx scripts/import-horariodemissa.ts --all
|
||
npx tsx scripts/import-horariodemissa.ts --all --dry-run
|
||
npx tsx scripts/import-horariodemissa.ts --state SP
|
||
npx tsx scripts/import-horariodemissa.ts --all --resume-from 500
|
||
npx tsx scripts/import-horariodemissa.ts --all --geocode # Nominatim pass
|
||
npx tsx scripts/import-horariodemissa.ts --geocode-only
|
||
npx tsx scripts/import-horariodemissa.ts --all --job-id {uuid}
|
||
```
|
||
|
||
### Rate Limiting
|
||
|
||
- City pages: 1.5s between requests (~3,552 × 1.5s ≈ 1.5 hours)
|
||
- Geocode (optional): 1.1s between Nominatim requests
|
||
|
||
---
|
||
|
||
## Importer 2: import-misas.ts (Spain)
|
||
|
||
### Source
|
||
|
||
- **Site**: https://misas.org
|
||
- **Coverage**: Spain only (despite claiming LatAm — API returns 0 for MX/AR/CO)
|
||
- **Data**: 17,919 churches with coordinates, name, address, province, zip
|
||
- **No mass schedules**: detail API returns 401 — church directory only
|
||
|
||
### API
|
||
|
||
```
|
||
GET https://misas.org/api/parishsearch?country=es&pos=[-3.7038,40.4168,999999]&offset=0&limit=500
|
||
```
|
||
|
||
Response:
|
||
```json
|
||
{
|
||
"count": 17919,
|
||
"pars": [
|
||
{
|
||
"id": 16604,
|
||
"name": "Parròquia de Sant Lliser",
|
||
"uri": "parroquia-de-sant-lliser-alos-disil",
|
||
"addr": "Carrer Bonabe, 4",
|
||
"loc": "Alòs d'Isil",
|
||
"prov": "Lérida",
|
||
"zip": "25586",
|
||
"lat": "42.701074",
|
||
"long": "1.100028"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### Enumeration Strategy
|
||
|
||
Paginate with `offset` in steps of 500 until all 17,919 churches fetched (~36 requests). Use Madrid center coordinates with radius=999999 to cover all of Spain.
|
||
|
||
### Matching Strategy
|
||
|
||
1. `misasOrgId` exact match (for re-runs)
|
||
2. Name + proximity (200m) against existing ES churches
|
||
3. Unmatched: create new church with coordinates, country=ES
|
||
|
||
No mass schedules written — church record only.
|
||
|
||
### Schema Addition
|
||
|
||
```prisma
|
||
misasOrgId String? @unique @map("misas_org_id")
|
||
@@index([misasOrgId])
|
||
```
|
||
|
||
### CLI
|
||
|
||
```bash
|
||
npx tsx scripts/import-misas.ts --all
|
||
npx tsx scripts/import-misas.ts --all --dry-run
|
||
npx tsx scripts/import-misas.ts --all --resume-from 5000
|
||
npx tsx scripts/import-misas.ts --all --job-id {uuid}
|
||
```
|
||
|
||
### Rate Limiting
|
||
|
||
- API pagination: 500ms between requests (~36 calls, minimal impact)
|
||
|
||
---
|
||
|
||
## Shared Implementation Patterns
|
||
|
||
Both scripts follow the standard importer pattern:
|
||
|
||
```typescript
|
||
// DB setup
|
||
dotenv.config(...)
|
||
const pool = new Pool({ connectionString: DATABASE_URL })
|
||
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) })
|
||
|
||
// church-matcher integration
|
||
import { findDuplicateChurch } from '../src/lib/church-matcher'
|
||
// ExistingChurch interface gets new ID fields added
|
||
|
||
// Standard flags
|
||
--all, --dry-run, --resume-from N, --job-id UUID
|
||
|
||
// Stats output
|
||
{ total, created, updated, skipped, errors }
|
||
```
|
||
|
||
Both added to:
|
||
- `package.json` scripts
|
||
- Scheduler pipeline (sequential imports group)
|
||
- `church-matcher.ts` ExistingChurch interface
|
||
|
||
---
|
||
|
||
## Estimated Scale
|
||
|
||
| | Brazil | Spain |
|
||
|---|---|---|
|
||
| Churches | 8,895 (all new) | 17,919 (~7,000 new vs horariosmisas) |
|
||
| Mass times | 28,523 | 0 (no schedule access) |
|
||
| Runtime | ~1.5h | ~5 min |
|
||
| Coordinates | No (address only) | Yes |
|