Files
ScraperControl/docs/superpowers/specs/2026-03-16-buscarmisas-network-importer-design.md

7.5 KiB
Raw Blame History

BuscarMisas Network Importer — Design Spec

Date: 2026-03-16 Status: Approved


Overview

Add a single importer scripts/import-buscarmisas-network.ts that scrapes church data and mass schedules from a network of 5 identical WordPress-based Catholic mass-time directories covering 5 Latin American countries (~15,294 churches total).


Network Sites

Domain Country Churches Language Sitemap Type
horariosmissa.com.br BR (Brazil) ~4,732 Portuguese page-sitemap*.xml
buscarmisas.com.mx MX (Mexico) ~3,950 Spanish page-sitemap*.xml
horariosmisa.com.ar AR (Argentina) ~3,012 Spanish page-sitemap*.xml
buscarmisas.co CO (Colombia) ~2,665 Spanish page-sitemap*.xml
horariomisa.cl CL (Chile) ~935 Spanish post-sitemap.xml

Schema Migration (prerequisite)

A new column must be added in BethelGuide (schema source of truth) before implementation:

buscarmisasNetworkId  String?  @unique  @map("buscamissas_network_id")
@@index([buscarmisasNetworkId])

After merging the migration in BethelGuide, copy the updated schema.prisma to ScraperControl and run npx prisma generate.

The external ID format is {domain-slug}/{church-slug}, e.g.: horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios

where domain-slug replaces . with -, and church-slug is the final path segment of the church URL.

church-matcher.ts and the ExistingChurch / ChurchCandidate interfaces must be updated to include buscarmisasNetworkId alongside the existing external ID fields, with a corresponding ID-match pass in findDuplicateChurch().


Architecture

Config Map

interface SiteConfig {
  country: string;   // ISO 3166-1 alpha-2
  language: 'pt' | 'es';
  sitemapType: 'page' | 'post';
}

const NETWORK_SITES: Record<string, SiteConfig> = {
  'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
  'buscarmisas.com.mx':   { country: 'MX', language: 'es', sitemapType: 'page' },
  'horariosmisa.com.ar':  { country: 'AR', language: 'es', sitemapType: 'page' },
  'buscarmisas.co':       { country: 'CO', language: 'es', sitemapType: 'page' },
  'horariomisa.cl':       { country: 'CL', language: 'es', sitemapType: 'post' },
};

CLI Interface

# Single domain
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br

# Single domain with resume (--resume-from only valid with --domain)
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --resume-from 1200

# All domains sequentially (no --resume-from; use --domain for resuming individual runs)
npx tsx scripts/import-buscarmisas-network.ts --all

# Dry run (no DB writes)
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run

Validation: If --domain is provided but not present in NETWORK_SITES, exit immediately with a clear error message listing valid domains. --resume-from combined with --all is also an error — exit with usage message.

Source slug stored in DB source field: buscarmisas-network (same value for all domains — the buscarmisasNetworkId distinguishes per-church).


Data Flow

1. Sitemap Discovery

  • Fetch https://{domain}/sitemap_index.xml
  • Extract child sitemap URLs
  • For sitemapType: 'page': collect all page-sitemap*.xml URLs (ignore post-sitemap*.xml and page-sitemap.xml city-only entries for Chile)
  • For sitemapType: 'post': collect post-sitemap.xml only
  • Fetch each child sitemap, filter to 3-segment church URLs (path segments = /{region}/{city}/{church-slug}/)
  • Collect deduplicated list of church page URLs

2. Church Page Parsing

For each church URL, fetch the HTML and extract:

Field Source
Name Table cell after <strong>Nome</strong> (PT) or <strong>Nombre</strong> (ES)
Address Table cell after <strong>Endereço</strong> (PT) or <strong>Dirección</strong> (ES)
Phone href="tel:..." anchor
Latitude/Longitude Google Maps iframe srccenter={lat}%2C{lng} parameter (confirmed present on all 5 network sites; same API key AIzaSyCNTEOso0tZG6YMSJFoaJEY5Th1stEWrJI used across the network)
Country From SiteConfig.country
State/Region 1st path segment of church URL (URL-decoded)
City 2nd path segment of church URL (URL-decoded)
Mass schedule MonSun table rows: day name → time string (skip - entries)
External ID {domain-slug}/{church-slug}

If center= is absent from the iframe src, skip the church with a warning log. Do not fall back to the q= parameter (it contains a search query, not coordinates).

3. Schedule Parsing

  • Use getDayNamesForCountry(config.country) from src/scrapers/i18n/day-names.ts to get the day-name map keyed by country code ('BR', 'MX', etc.)
  • Build patterns with buildDayPatterns(dayNames) and match against the table's day-name cell text
  • Times are comma-separated within a single cell (e.g. 10:00, 18:00) — split on , and create one schedule entry per time
  • - entries indicate no mass that day — skip

4. Upsert to Database

  • Load all existing buscarmisasNetworkId values from DB into a Set at startup — skip already-imported churches (same pattern as import-discovermass.ts)
  • Use findDuplicateChurch() for new churches to detect cross-source duplicates
  • Upsert church with source = 'buscarmisas-network' and buscarmisasNetworkId = externalId
  • Upsert mass schedules linked to church
  • Log progress every 100 churches; support --resume-from {index} (single-domain mode only)

5. Rate Limiting

  • 2-second delay between church page fetches
  • 5-second delay between domains when running --all
  • On HTTP 429 or 503: exponential backoff, up to 3 retries, then skip with warning

Error Handling

  • Skip churches where center= lat/lng is absent (log warning, continue)
  • Skip churches where name is empty after parsing
  • On fetch error: log and continue to next URL
  • On DB error: log and continue

Integration

package.json

"import:buscarmisas-network": "tsx scripts/import-buscarmisas-network.ts"

Scheduler (scripts/scheduler.ts)

Add one PipelinePhase per domain (5 total) so each country can be scheduled and monitored independently. Each phase's type string must match exactly between PIPELINE_GROUPS and the case label in getJobCommand() — mismatch silently throws "Unknown job type". The type field in the DB BackgroundJob model is a plain String, consistent with existing values like 'discovermass-import'.

All 5 phases and their corresponding case blocks:

Phase type --domain
buscarmisas-network-BR horariosmissa.com.br
buscarmisas-network-MX buscarmisas.com.mx
buscarmisas-network-AR horariosmisa.com.ar
buscarmisas-network-CO buscarmisas.co
buscarmisas-network-CL horariomisa.cl
case 'buscarmisas-network-BR':
  return `npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br`;
case 'buscarmisas-network-MX':
  return `npx tsx scripts/import-buscarmisas-network.ts --domain buscarmisas.com.mx`;
// ... etc for AR, CO, CL

Out of Scope

  • The horairemesses.ch (Switzerland), gottesdienstheute.de (Germany), and masstime.co.uk (UK) network sites are excluded — those countries already have dedicated importers
  • Chile's page-sitemap.xml contains only city pages (not churches) — only post-sitemap.xml is used for Chile