Files
ScraperControl/docs/superpowers/specs/2026-03-10-brazil-spain-importers-design.md
albertfj114 0e468bcb94 docs: add Brazil + Spain importers design spec and implementation plan
Two new importers:
- horariodemissa.com.br: 8,895 Brazilian churches + 28,523 mass times
- misas.org: 17,919 Spanish churches with coordinates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 19:50:54 -04:00

193 lines
5.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: Brazil (horariodemissa.com.br) + Spain (misas.org) Importers
## Overview
Two parallel importers targeting the highest-value uncovered regions:
- **Brazil** — zero current coverage, 8,895 churches + 28,523 mass times
- **Spain supplement** — 17,919 churches with coordinates (fills gaps vs horariosmisas.com's ~10,000)
---
## Importer 1: import-horariodemissa.ts (Brazil)
### Source
- **Site**: https://horariodemissa.com.br
- **Coverage**: All 26 Brazilian states + DF
- **Data**: 8,895 churches, 28,523 mass times (server-rendered, no auth needed)
- **robots.txt**: Only disallows `/404.php` — fully permissive
### Enumeration Strategy
Fetch `https://horariodemissa.com.br/sitemap.xml` → extract unique city URLs filtered to `hl=pt` only (~3,552 unique cities). URL pattern:
```
https://horariodemissa.com.br/search.php?uf={STATE}&cidade={CITY}&bairro=&opcoes=cidade_opcoes&submit=12345678&hl=pt
```
### HTML Parsing
Each city page contains `.result` divs (server-rendered). Per church:
- **Key**: `href` of `.result_title` link → `igreja.php?k=XXXXX` (alphanumeric, used as `horarioDemissaId`)
- **Name**: `.result_title` link text
- **Address**: text node after the `<br/>` in the first `<p>` within `.result`
- **Phone**: `<p>` containing `Telefone:`
- **Mass schedule**: first `<table>` — rows with `<td style="text-align:right;font-weight:bold;">DAY:</td><td>TIMES</td>`
- **Confession schedule**: second `<table>` (same structure, times as ranges `HH:MM às HH:MM`)
### Day Name Mapping
| Portuguese | dayOfWeek |
|---|---|
| Domingo | 0 (Sunday) |
| Segunda-feira | 1 (Monday) |
| Terça-feira | 2 (Tuesday) |
| Quarta-feira | 3 (Wednesday) |
| Quinta-feira | 4 (Thursday) |
| Sexta-feira | 5 (Friday) |
| Sábado | 6 (Saturday) |
| Primeiro Sábado | 6, notes="Primeiro Sábado" |
| Segundo Domingo | 0, notes="Segundo Domingo" |
Time format: `HH:MM` (24h, already in correct format). Multiple times comma-separated.
Notes in parentheses e.g. `(Forma Extraordinária do Rito Romano)` → strip and store as `massType` or `notes`.
### Matching Strategy
1. `horarioDemissaId` exact match (for re-runs)
2. Name + proximity (200m) against existing BR churches (some may exist from OSM)
3. Unmatched: create new church, country=BR, no coordinates
### Schema Addition
```prisma
horarioDemissaId String? @unique @map("horario_demissa_id")
@@index([horarioDemissaId])
```
### CLI
```bash
npx tsx scripts/import-horariodemissa.ts --all
npx tsx scripts/import-horariodemissa.ts --all --dry-run
npx tsx scripts/import-horariodemissa.ts --state SP
npx tsx scripts/import-horariodemissa.ts --all --resume-from 500
npx tsx scripts/import-horariodemissa.ts --all --geocode # Nominatim pass
npx tsx scripts/import-horariodemissa.ts --geocode-only
npx tsx scripts/import-horariodemissa.ts --all --job-id {uuid}
```
### Rate Limiting
- City pages: 1.5s between requests (~3,552 × 1.5s ≈ 1.5 hours)
- Geocode (optional): 1.1s between Nominatim requests
---
## Importer 2: import-misas.ts (Spain)
### Source
- **Site**: https://misas.org
- **Coverage**: Spain only (despite claiming LatAm — API returns 0 for MX/AR/CO)
- **Data**: 17,919 churches with coordinates, name, address, province, zip
- **No mass schedules**: detail API returns 401 — church directory only
### API
```
GET https://misas.org/api/parishsearch?country=es&pos=[-3.7038,40.4168,999999]&offset=0&limit=500
```
Response:
```json
{
"count": 17919,
"pars": [
{
"id": 16604,
"name": "Parròquia de Sant Lliser",
"uri": "parroquia-de-sant-lliser-alos-disil",
"addr": "Carrer Bonabe, 4",
"loc": "Alòs d'Isil",
"prov": "Lérida",
"zip": "25586",
"lat": "42.701074",
"long": "1.100028"
}
]
}
```
### Enumeration Strategy
Paginate with `offset` in steps of 500 until all 17,919 churches fetched (~36 requests). Use Madrid center coordinates with radius=999999 to cover all of Spain.
### Matching Strategy
1. `misasOrgId` exact match (for re-runs)
2. Name + proximity (200m) against existing ES churches
3. Unmatched: create new church with coordinates, country=ES
No mass schedules written — church record only.
### Schema Addition
```prisma
misasOrgId String? @unique @map("misas_org_id")
@@index([misasOrgId])
```
### CLI
```bash
npx tsx scripts/import-misas.ts --all
npx tsx scripts/import-misas.ts --all --dry-run
npx tsx scripts/import-misas.ts --all --resume-from 5000
npx tsx scripts/import-misas.ts --all --job-id {uuid}
```
### Rate Limiting
- API pagination: 500ms between requests (~36 calls, minimal impact)
---
## Shared Implementation Patterns
Both scripts follow the standard importer pattern:
```typescript
// DB setup
dotenv.config(...)
const pool = new Pool({ connectionString: DATABASE_URL })
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) })
// church-matcher integration
import { findDuplicateChurch } from '../src/lib/church-matcher'
// ExistingChurch interface gets new ID fields added
// Standard flags
--all, --dry-run, --resume-from N, --job-id UUID
// Stats output
{ total, created, updated, skipped, errors }
```
Both added to:
- `package.json` scripts
- Scheduler pipeline (sequential imports group)
- `church-matcher.ts` ExistingChurch interface
---
## Estimated Scale
| | Brazil | Spain |
|---|---|---|
| Churches | 8,895 (all new) | 17,919 (~7,000 new vs horariosmisas) |
| Mass times | 28,523 | 0 (no schedule access) |
| Runtime | ~1.5h | ~5 min |
| Coordinates | No (address only) | Yes |