Files
ScraperControl/docs/superpowers/specs/2026-03-10-brazil-spain-importers-design.md
albertfj114 0e468bcb94 docs: add Brazil + Spain importers design spec and implementation plan
Two new importers:
- horariodemissa.com.br: 8,895 Brazilian churches + 28,523 mass times
- misas.org: 17,919 Spanish churches with coordinates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 19:50:54 -04:00

5.4 KiB
Raw Blame History

Design: Brazil (horariodemissa.com.br) + Spain (misas.org) Importers

Overview

Two parallel importers targeting the highest-value uncovered regions:

  • Brazil — zero current coverage, 8,895 churches + 28,523 mass times
  • Spain supplement — 17,919 churches with coordinates (fills gaps vs horariosmisas.com's ~10,000)

Importer 1: import-horariodemissa.ts (Brazil)

Source

  • Site: https://horariodemissa.com.br
  • Coverage: All 26 Brazilian states + DF
  • Data: 8,895 churches, 28,523 mass times (server-rendered, no auth needed)
  • robots.txt: Only disallows /404.php — fully permissive

Enumeration Strategy

Fetch https://horariodemissa.com.br/sitemap.xml → extract unique city URLs filtered to hl=pt only (~3,552 unique cities). URL pattern:

https://horariodemissa.com.br/search.php?uf={STATE}&cidade={CITY}&bairro=&opcoes=cidade_opcoes&submit=12345678&hl=pt

HTML Parsing

Each city page contains .result divs (server-rendered). Per church:

  • Key: href of .result_title link → igreja.php?k=XXXXX (alphanumeric, used as horarioDemissaId)
  • Name: .result_title link text
  • Address: text node after the <br/> in the first <p> within .result
  • Phone: <p> containing Telefone:
  • Mass schedule: first <table> — rows with <td style="text-align:right;font-weight:bold;">DAY:</td><td>TIMES</td>
  • Confession schedule: second <table> (same structure, times as ranges HH:MM às HH:MM)

Day Name Mapping

Portuguese dayOfWeek
Domingo 0 (Sunday)
Segunda-feira 1 (Monday)
Terça-feira 2 (Tuesday)
Quarta-feira 3 (Wednesday)
Quinta-feira 4 (Thursday)
Sexta-feira 5 (Friday)
Sábado 6 (Saturday)
Primeiro Sábado 6, notes="Primeiro Sábado"
Segundo Domingo 0, notes="Segundo Domingo"

Time format: HH:MM (24h, already in correct format). Multiple times comma-separated. Notes in parentheses e.g. (Forma Extraordinária do Rito Romano) → strip and store as massType or notes.

Matching Strategy

  1. horarioDemissaId exact match (for re-runs)
  2. Name + proximity (200m) against existing BR churches (some may exist from OSM)
  3. Unmatched: create new church, country=BR, no coordinates

Schema Addition

horarioDemissaId  String?  @unique @map("horario_demissa_id")
@@index([horarioDemissaId])

CLI

npx tsx scripts/import-horariodemissa.ts --all
npx tsx scripts/import-horariodemissa.ts --all --dry-run
npx tsx scripts/import-horariodemissa.ts --state SP
npx tsx scripts/import-horariodemissa.ts --all --resume-from 500
npx tsx scripts/import-horariodemissa.ts --all --geocode       # Nominatim pass
npx tsx scripts/import-horariodemissa.ts --geocode-only
npx tsx scripts/import-horariodemissa.ts --all --job-id {uuid}

Rate Limiting

  • City pages: 1.5s between requests (~3,552 × 1.5s ≈ 1.5 hours)
  • Geocode (optional): 1.1s between Nominatim requests

Importer 2: import-misas.ts (Spain)

Source

  • Site: https://misas.org
  • Coverage: Spain only (despite claiming LatAm — API returns 0 for MX/AR/CO)
  • Data: 17,919 churches with coordinates, name, address, province, zip
  • No mass schedules: detail API returns 401 — church directory only

API

GET https://misas.org/api/parishsearch?country=es&pos=[-3.7038,40.4168,999999]&offset=0&limit=500

Response:

{
  "count": 17919,
  "pars": [
    {
      "id": 16604,
      "name": "Parròquia de Sant Lliser",
      "uri": "parroquia-de-sant-lliser-alos-disil",
      "addr": "Carrer Bonabe, 4",
      "loc": "Alòs d'Isil",
      "prov": "Lérida",
      "zip": "25586",
      "lat": "42.701074",
      "long": "1.100028"
    }
  ]
}

Enumeration Strategy

Paginate with offset in steps of 500 until all 17,919 churches fetched (~36 requests). Use Madrid center coordinates with radius=999999 to cover all of Spain.

Matching Strategy

  1. misasOrgId exact match (for re-runs)
  2. Name + proximity (200m) against existing ES churches
  3. Unmatched: create new church with coordinates, country=ES

No mass schedules written — church record only.

Schema Addition

misasOrgId  String?  @unique @map("misas_org_id")
@@index([misasOrgId])

CLI

npx tsx scripts/import-misas.ts --all
npx tsx scripts/import-misas.ts --all --dry-run
npx tsx scripts/import-misas.ts --all --resume-from 5000
npx tsx scripts/import-misas.ts --all --job-id {uuid}

Rate Limiting

  • API pagination: 500ms between requests (~36 calls, minimal impact)

Shared Implementation Patterns

Both scripts follow the standard importer pattern:

// DB setup
dotenv.config(...)
const pool = new Pool({ connectionString: DATABASE_URL })
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) })

// church-matcher integration
import { findDuplicateChurch } from '../src/lib/church-matcher'
// ExistingChurch interface gets new ID fields added

// Standard flags
--all, --dry-run, --resume-from N, --job-id UUID

// Stats output
{ total, created, updated, skipped, errors }

Both added to:

  • package.json scripts
  • Scheduler pipeline (sequential imports group)
  • church-matcher.ts ExistingChurch interface

Estimated Scale

Brazil Spain
Churches 8,895 (all new) 17,919 (~7,000 new vs horariosmisas)
Mass times 28,523 0 (no schedule access)
Runtime ~1.5h ~5 min
Coordinates No (address only) Yes