# Design: Brazil (horariodemissa.com.br) + Spain (misas.org) Importers ## Overview Two parallel importers targeting the highest-value uncovered regions: - **Brazil** — zero current coverage, 8,895 churches + 28,523 mass times - **Spain supplement** — 17,919 churches with coordinates (fills gaps vs horariosmisas.com's ~10,000) --- ## Importer 1: import-horariodemissa.ts (Brazil) ### Source - **Site**: https://horariodemissa.com.br - **Coverage**: All 26 Brazilian states + DF - **Data**: 8,895 churches, 28,523 mass times (server-rendered, no auth needed) - **robots.txt**: Only disallows `/404.php` — fully permissive ### Enumeration Strategy Fetch `https://horariodemissa.com.br/sitemap.xml` → extract unique city URLs filtered to `hl=pt` only (~3,552 unique cities). URL pattern: ``` https://horariodemissa.com.br/search.php?uf={STATE}&cidade={CITY}&bairro=&opcoes=cidade_opcoes&submit=12345678&hl=pt ``` ### HTML Parsing Each city page contains `.result` divs (server-rendered). Per church: - **Key**: `href` of `.result_title` link → `igreja.php?k=XXXXX` (alphanumeric, used as `horarioDemissaId`) - **Name**: `.result_title` link text - **Address**: text node after the `
` in the first `

` within `.result` - **Phone**: `

` containing `Telefone:` - **Mass schedule**: first `` — rows with `` - **Confession schedule**: second `
DAY:TIMES
` (same structure, times as ranges `HH:MM às HH:MM`) ### Day Name Mapping | Portuguese | dayOfWeek | |---|---| | Domingo | 0 (Sunday) | | Segunda-feira | 1 (Monday) | | Terça-feira | 2 (Tuesday) | | Quarta-feira | 3 (Wednesday) | | Quinta-feira | 4 (Thursday) | | Sexta-feira | 5 (Friday) | | Sábado | 6 (Saturday) | | Primeiro Sábado | 6, notes="Primeiro Sábado" | | Segundo Domingo | 0, notes="Segundo Domingo" | Time format: `HH:MM` (24h, already in correct format). Multiple times comma-separated. Notes in parentheses e.g. `(Forma Extraordinária do Rito Romano)` → strip and store as `massType` or `notes`. ### Matching Strategy 1. `horarioDemissaId` exact match (for re-runs) 2. Name + proximity (200m) against existing BR churches (some may exist from OSM) 3. Unmatched: create new church, country=BR, no coordinates ### Schema Addition ```prisma horarioDemissaId String? @unique @map("horario_demissa_id") @@index([horarioDemissaId]) ``` ### CLI ```bash npx tsx scripts/import-horariodemissa.ts --all npx tsx scripts/import-horariodemissa.ts --all --dry-run npx tsx scripts/import-horariodemissa.ts --state SP npx tsx scripts/import-horariodemissa.ts --all --resume-from 500 npx tsx scripts/import-horariodemissa.ts --all --geocode # Nominatim pass npx tsx scripts/import-horariodemissa.ts --geocode-only npx tsx scripts/import-horariodemissa.ts --all --job-id {uuid} ``` ### Rate Limiting - City pages: 1.5s between requests (~3,552 × 1.5s ≈ 1.5 hours) - Geocode (optional): 1.1s between Nominatim requests --- ## Importer 2: import-misas.ts (Spain) ### Source - **Site**: https://misas.org - **Coverage**: Spain only (despite claiming LatAm — API returns 0 for MX/AR/CO) - **Data**: 17,919 churches with coordinates, name, address, province, zip - **No mass schedules**: detail API returns 401 — church directory only ### API ``` GET https://misas.org/api/parishsearch?country=es&pos=[-3.7038,40.4168,999999]&offset=0&limit=500 ``` Response: ```json { "count": 17919, "pars": [ { "id": 16604, "name": "Parròquia de Sant Lliser", "uri": "parroquia-de-sant-lliser-alos-disil", "addr": "Carrer Bonabe, 4", "loc": "Alòs d'Isil", "prov": "Lérida", "zip": "25586", "lat": "42.701074", "long": "1.100028" } ] } ``` ### Enumeration Strategy Paginate with `offset` in steps of 500 until all 17,919 churches fetched (~36 requests). Use Madrid center coordinates with radius=999999 to cover all of Spain. ### Matching Strategy 1. `misasOrgId` exact match (for re-runs) 2. Name + proximity (200m) against existing ES churches 3. Unmatched: create new church with coordinates, country=ES No mass schedules written — church record only. ### Schema Addition ```prisma misasOrgId String? @unique @map("misas_org_id") @@index([misasOrgId]) ``` ### CLI ```bash npx tsx scripts/import-misas.ts --all npx tsx scripts/import-misas.ts --all --dry-run npx tsx scripts/import-misas.ts --all --resume-from 5000 npx tsx scripts/import-misas.ts --all --job-id {uuid} ``` ### Rate Limiting - API pagination: 500ms between requests (~36 calls, minimal impact) --- ## Shared Implementation Patterns Both scripts follow the standard importer pattern: ```typescript // DB setup dotenv.config(...) const pool = new Pool({ connectionString: DATABASE_URL }) const prisma = new PrismaClient({ adapter: new PrismaPg(pool) }) // church-matcher integration import { findDuplicateChurch } from '../src/lib/church-matcher' // ExistingChurch interface gets new ID fields added // Standard flags --all, --dry-run, --resume-from N, --job-id UUID // Stats output { total, created, updated, skipped, errors } ``` Both added to: - `package.json` scripts - Scheduler pipeline (sequential imports group) - `church-matcher.ts` ExistingChurch interface --- ## Estimated Scale | | Brazil | Spain | |---|---|---| | Churches | 8,895 (all new) | 17,919 (~7,000 new vs horariosmisas) | | Mass times | 28,523 | 0 (no schedule access) | | Runtime | ~1.5h | ~5 min | | Coordinates | No (address only) | Yes |