Files
ScraperControl/docs/superpowers/specs/2026-03-12-buscarmisas-network-importer-design.md
albertfj114 ef01616ad8 docs: add design spec for buscarmisas network importer
Covers 7-country Latin America + UK + Switzerland mass times
network (horariosmissa.com.br and sister sites), all sharing
identical WordPress structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 00:16:37 -04:00

6.3 KiB
Raw Permalink Blame History

Design: BuscarMisas Network Importer

Date: 2026-03-12 Script: scripts/import-buscarmisas-network.ts

Overview

A single importer script covering a network of 7 Catholic mass-time directory sites that share identical WordPress/Yoast structure. All sites use the URL pattern /{region}/{city}/{slug}/ and the same HTML layout with a mass schedule table. One script with a --site <code> flag handles all countries.

Sites

Code Domain Country Language Est. Churches
br horariosmissa.com.br Brazil Portuguese ~2,000
mx buscarmisas.com.mx Mexico Spanish ~2,000
ar horariosmisa.com.ar Argentina Spanish ~2,000
co buscarmisas.co Colombia Spanish ~1,000
cl horariomisa.cl Chile Spanish ~1,000
gb masstime.co.uk United Kingdom English ~1,000
ch horairemesses.ch Switzerland French ~500

Total: ~10,500 new churches across 7 countries

Architecture

Site Registry

const SITES: Record<string, SiteConfig> = {
  br: { baseUrl: 'https://horariosmissa.com.br', country: 'BR', language: 'pt' },
  mx: { baseUrl: 'https://buscarmisas.com.mx',  country: 'MX', language: 'es' },
  ar: { baseUrl: 'https://horariosmisa.com.ar',  country: 'AR', language: 'es' },
  co: { baseUrl: 'https://buscarmisas.co',        country: 'CO', language: 'es' },
  cl: { baseUrl: 'https://horariomisa.cl',         country: 'CL', language: 'es' },
  gb: { baseUrl: 'https://masstime.co.uk',         country: 'GB', language: 'en' },
  ch: { baseUrl: 'https://horairemesses.ch',        country: 'CH', language: 'fr' },
};

Each SiteConfig includes a dayMap: Record<string, number> mapping localized day names to 06 (SunSat).

Processing Flow

  1. Sitemap discovery — fetch {baseUrl}/sitemap_index.xml → extract page-sitemap*.xml URLs
  2. URL collection — fetch each page-sitemap → extract church URLs filtering to exactly 3 path segments (/{region}/{city}/{slug}/)
  3. Dedup skip — load existing DB churches for that country; skip URLs whose slug is already stored as source+sourceId
  4. Per-church fetch — GET church page, parse HTML:
    • Name: H1 heading
    • Address: contact info table (Endereço/Dirección/Address/Adresse)
    • City/Region: from URL path segments or table
    • Phone: table row
    • Website: table row
    • Mass schedule: <table> with day column and time column
  5. Geocoding — if no coordinates: Nominatim lookup on {name}, {city}, {country}
  6. UpsertfindDuplicateChurch() match → create or update church + mass schedules
  7. Rate limiting — 3s between HTTP requests; 1.1s between Nominatim requests

Data Extraction

Church page HTML structure (identical across all sites):

<h1>Church Name</h1>
<table>
  <tr><td>Endereço</td><td>Rua X, 123, City</td></tr>
  <tr><td>Telefone</td><td>+55 11 1234-5678</td></tr>
  <tr><td>Site</td><td>https://...</td></tr>
</table>
<table>  <!-- mass schedule -->
  <tr><th>Dia</th><th>Horário da Missa</th></tr>
  <tr><td>Segunda-feira</td><td>08:00</td></tr>
  ...
</table>

Multiple times per day are comma- or newline-separated within the time cell.

Day Maps

  • Portuguese (br): Segunda-feira=1, Terça-feira=2, Quarta-feira=3, Quinta-feira=4, Sexta-feira=5, Sábado=6, Domingo=0
  • Spanish (mx/ar/co/cl): Lunes=1, Martes=2, Miércoles=3, Jueves=4, Viernes=5, Sábado=6, Domingo=0
  • English (gb): Monday=1, Tuesday=2, Wednesday=3, Thursday=4, Friday=5, Saturday=6, Sunday=0
  • French (ch): Lundi=1, Mardi=2, Mercredi=3, Jeudi=4, Vendredi=5, Samedi=6, Dimanche=0

URL Exclusion Patterns

Skip non-church pages matching:

  • /about*/, /contact*/, /privacy*/, /cookie*/, /terms*/
  • /blog/, /news*/, /noticias/, /actualidad/
  • Any page with < 3 path segments (state/city index pages)

CLI Interface

npx tsx scripts/import-buscarmisas-network.ts --all --site br
npx tsx scripts/import-buscarmisas-network.ts --all --site mx --dry-run
npx tsx scripts/import-buscarmisas-network.ts --all --site gb --limit 300
npx tsx scripts/import-buscarmisas-network.ts --all --site ar --resume-from 500
npx tsx scripts/import-buscarmisas-network.ts --all --site co --job-id {uuid}
Flag Description
--site <code> Required. Which site to import (br/mx/ar/co/cl/gb/ch)
--all Run full import
--dry-run Parse only, no DB writes
--limit N Cap churches processed per run
--resume-from N Skip first N church URLs
--job-id <uuid> Bind to background job record

Scheduler Integration

7 new sequential phases appended to the imports pipeline group in scripts/scheduler.ts:

{ name: 'buscarmisas-br', type: 'buscarmisas-network-import', config: { site: 'br', limit: 300 } },
{ name: 'buscarmisas-mx', type: 'buscarmisas-network-import', config: { site: 'mx', limit: 300 } },
{ name: 'buscarmisas-ar', type: 'buscarmisas-network-import', config: { site: 'ar', limit: 300 } },
{ name: 'buscarmisas-co', type: 'buscarmisas-network-import', config: { site: 'co', limit: 300 } },
{ name: 'buscarmisas-cl', type: 'buscarmisas-network-import', config: { site: 'cl', limit: 300 } },
{ name: 'buscarmisas-gb', type: 'buscarmisas-network-import', config: { site: 'gb', limit: 300 } },
{ name: 'buscarmisas-ch', type: 'buscarmisas-network-import', config: { site: 'ch', limit: 300 } },

New getJobCommand case:

case 'buscarmisas-network-import': {
  const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--all'];
  if (config?.site) args.push('--site', String(config.site));
  if (config?.limit) args.push('--limit', String(config.limit));
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}

Error Handling

  • Failed fetches: log and skip, continue with next church
  • Parse failures: log and skip (don't crash)
  • DB errors during upsert: log and skip
  • Unhandled exception in main(): catch → update job to failed → rethrow (same pattern as other importers)

Files Modified

  • New: scripts/import-buscarmisas-network.ts
  • Modified: scripts/scheduler.ts — add 7 pipeline phases + getJobCommand case