7.5 KiB
BuscarMisas Network Importer — Design Spec
Date: 2026-03-16 Status: Approved
Overview
Add a single importer scripts/import-buscarmisas-network.ts that scrapes church data and mass schedules from a network of 5 identical WordPress-based Catholic mass-time directories covering 5 Latin American countries (~15,294 churches total).
Network Sites
| Domain | Country | Churches | Language | Sitemap Type |
|---|---|---|---|---|
horariosmissa.com.br |
BR (Brazil) | ~4,732 | Portuguese | page-sitemap*.xml |
buscarmisas.com.mx |
MX (Mexico) | ~3,950 | Spanish | page-sitemap*.xml |
horariosmisa.com.ar |
AR (Argentina) | ~3,012 | Spanish | page-sitemap*.xml |
buscarmisas.co |
CO (Colombia) | ~2,665 | Spanish | page-sitemap*.xml |
horariomisa.cl |
CL (Chile) | ~935 | Spanish | post-sitemap.xml |
Schema Migration (prerequisite)
A new column must be added in BethelGuide (schema source of truth) before implementation:
buscarmisasNetworkId String? @unique @map("buscamissas_network_id")
@@index([buscarmisasNetworkId])
After merging the migration in BethelGuide, copy the updated schema.prisma to ScraperControl and run npx prisma generate.
The external ID format is {domain-slug}/{church-slug}, e.g.:
horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios
where domain-slug replaces . with -, and church-slug is the final path segment of the church URL.
church-matcher.ts and the ExistingChurch / ChurchCandidate interfaces must be updated to include buscarmisasNetworkId alongside the existing external ID fields, with a corresponding ID-match pass in findDuplicateChurch().
Architecture
Config Map
interface SiteConfig {
country: string; // ISO 3166-1 alpha-2
language: 'pt' | 'es';
sitemapType: 'page' | 'post';
}
const NETWORK_SITES: Record<string, SiteConfig> = {
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
'buscarmisas.com.mx': { country: 'MX', language: 'es', sitemapType: 'page' },
'horariosmisa.com.ar': { country: 'AR', language: 'es', sitemapType: 'page' },
'buscarmisas.co': { country: 'CO', language: 'es', sitemapType: 'page' },
'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' },
};
CLI Interface
# Single domain
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br
# Single domain with resume (--resume-from only valid with --domain)
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --resume-from 1200
# All domains sequentially (no --resume-from; use --domain for resuming individual runs)
npx tsx scripts/import-buscarmisas-network.ts --all
# Dry run (no DB writes)
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run
Validation: If --domain is provided but not present in NETWORK_SITES, exit immediately with a clear error message listing valid domains. --resume-from combined with --all is also an error — exit with usage message.
Source slug stored in DB source field: buscarmisas-network (same value for all domains — the buscarmisasNetworkId distinguishes per-church).
Data Flow
1. Sitemap Discovery
- Fetch
https://{domain}/sitemap_index.xml - Extract child sitemap URLs
- For
sitemapType: 'page': collect allpage-sitemap*.xmlURLs (ignorepost-sitemap*.xmlandpage-sitemap.xmlcity-only entries for Chile) - For
sitemapType: 'post': collectpost-sitemap.xmlonly - Fetch each child sitemap, filter to 3-segment church URLs (path segments =
/{region}/{city}/{church-slug}/) - Collect deduplicated list of church page URLs
2. Church Page Parsing
For each church URL, fetch the HTML and extract:
| Field | Source |
|---|---|
| Name | Table cell after <strong>Nome</strong> (PT) or <strong>Nombre</strong> (ES) |
| Address | Table cell after <strong>Endereço</strong> (PT) or <strong>Dirección</strong> (ES) |
| Phone | href="tel:..." anchor |
| Latitude/Longitude | Google Maps iframe src — center={lat}%2C{lng} parameter (confirmed present on all 5 network sites; same API key AIzaSyCNTEOso0tZG6YMSJFoaJEY5Th1stEWrJI used across the network) |
| Country | From SiteConfig.country |
| State/Region | 1st path segment of church URL (URL-decoded) |
| City | 2nd path segment of church URL (URL-decoded) |
| Mass schedule | Mon–Sun table rows: day name → time string (skip - entries) |
| External ID | {domain-slug}/{church-slug} |
If center= is absent from the iframe src, skip the church with a warning log. Do not fall back to the q= parameter (it contains a search query, not coordinates).
3. Schedule Parsing
- Use
getDayNamesForCountry(config.country)fromsrc/scrapers/i18n/day-names.tsto get the day-name map keyed by country code ('BR','MX', etc.) - Build patterns with
buildDayPatterns(dayNames)and match against the table's day-name cell text - Times are comma-separated within a single cell (e.g.
10:00, 18:00) — split on,and create one schedule entry per time -entries indicate no mass that day — skip
4. Upsert to Database
- Load all existing
buscarmisasNetworkIdvalues from DB into aSetat startup — skip already-imported churches (same pattern asimport-discovermass.ts) - Use
findDuplicateChurch()for new churches to detect cross-source duplicates - Upsert church with
source = 'buscarmisas-network'andbuscarmisasNetworkId = externalId - Upsert mass schedules linked to church
- Log progress every 100 churches; support
--resume-from {index}(single-domain mode only)
5. Rate Limiting
- 2-second delay between church page fetches
- 5-second delay between domains when running
--all - On HTTP 429 or 503: exponential backoff, up to 3 retries, then skip with warning
Error Handling
- Skip churches where
center=lat/lng is absent (log warning, continue) - Skip churches where name is empty after parsing
- On fetch error: log and continue to next URL
- On DB error: log and continue
Integration
package.json
"import:buscarmisas-network": "tsx scripts/import-buscarmisas-network.ts"
Scheduler (scripts/scheduler.ts)
Add one PipelinePhase per domain (5 total) so each country can be scheduled and monitored independently. Each phase's type string must match exactly between PIPELINE_GROUPS and the case label in getJobCommand() — mismatch silently throws "Unknown job type". The type field in the DB BackgroundJob model is a plain String, consistent with existing values like 'discovermass-import'.
All 5 phases and their corresponding case blocks:
Phase type |
--domain |
|---|---|
buscarmisas-network-BR |
horariosmissa.com.br |
buscarmisas-network-MX |
buscarmisas.com.mx |
buscarmisas-network-AR |
horariosmisa.com.ar |
buscarmisas-network-CO |
buscarmisas.co |
buscarmisas-network-CL |
horariomisa.cl |
case 'buscarmisas-network-BR':
return `npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br`;
case 'buscarmisas-network-MX':
return `npx tsx scripts/import-buscarmisas-network.ts --domain buscarmisas.com.mx`;
// ... etc for AR, CO, CL
Out of Scope
- The
horairemesses.ch(Switzerland),gottesdienstheute.de(Germany), andmasstime.co.uk(UK) network sites are excluded — those countries already have dedicated importers - Chile's
page-sitemap.xmlcontains only city pages (not churches) — onlypost-sitemap.xmlis used for Chile