Covers 7-country Latin America + UK + Switzerland mass times network (horariosmissa.com.br and sister sites), all sharing identical WordPress structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.3 KiB
Design: BuscarMisas Network Importer
Date: 2026-03-12
Script: scripts/import-buscarmisas-network.ts
Overview
A single importer script covering a network of 7 Catholic mass-time directory sites that share identical WordPress/Yoast structure. All sites use the URL pattern /{region}/{city}/{slug}/ and the same HTML layout with a mass schedule table. One script with a --site <code> flag handles all countries.
Sites
| Code | Domain | Country | Language | Est. Churches |
|---|---|---|---|---|
br |
horariosmissa.com.br | Brazil | Portuguese | ~2,000 |
mx |
buscarmisas.com.mx | Mexico | Spanish | ~2,000 |
ar |
horariosmisa.com.ar | Argentina | Spanish | ~2,000 |
co |
buscarmisas.co | Colombia | Spanish | ~1,000 |
cl |
horariomisa.cl | Chile | Spanish | ~1,000 |
gb |
masstime.co.uk | United Kingdom | English | ~1,000 |
ch |
horairemesses.ch | Switzerland | French | ~500 |
Total: ~10,500 new churches across 7 countries
Architecture
Site Registry
const SITES: Record<string, SiteConfig> = {
br: { baseUrl: 'https://horariosmissa.com.br', country: 'BR', language: 'pt' },
mx: { baseUrl: 'https://buscarmisas.com.mx', country: 'MX', language: 'es' },
ar: { baseUrl: 'https://horariosmisa.com.ar', country: 'AR', language: 'es' },
co: { baseUrl: 'https://buscarmisas.co', country: 'CO', language: 'es' },
cl: { baseUrl: 'https://horariomisa.cl', country: 'CL', language: 'es' },
gb: { baseUrl: 'https://masstime.co.uk', country: 'GB', language: 'en' },
ch: { baseUrl: 'https://horairemesses.ch', country: 'CH', language: 'fr' },
};
Each SiteConfig includes a dayMap: Record<string, number> mapping localized day names to 0–6 (Sun–Sat).
Processing Flow
- Sitemap discovery — fetch
{baseUrl}/sitemap_index.xml→ extractpage-sitemap*.xmlURLs - URL collection — fetch each page-sitemap → extract church URLs filtering to exactly 3 path segments (
/{region}/{city}/{slug}/) - Dedup skip — load existing DB churches for that country; skip URLs whose slug is already stored as
source+sourceId - Per-church fetch — GET church page, parse HTML:
- Name: H1 heading
- Address: contact info table (
Endereço/Dirección/Address/Adresse) - City/Region: from URL path segments or table
- Phone: table row
- Website: table row
- Mass schedule:
<table>with day column and time column
- Geocoding — if no coordinates: Nominatim lookup on
{name}, {city}, {country} - Upsert —
findDuplicateChurch()match → create or update church + mass schedules - Rate limiting — 3s between HTTP requests; 1.1s between Nominatim requests
Data Extraction
Church page HTML structure (identical across all sites):
<h1>Church Name</h1>
<table>
<tr><td>Endereço</td><td>Rua X, 123, City</td></tr>
<tr><td>Telefone</td><td>+55 11 1234-5678</td></tr>
<tr><td>Site</td><td>https://...</td></tr>
</table>
<table> <!-- mass schedule -->
<tr><th>Dia</th><th>Horário da Missa</th></tr>
<tr><td>Segunda-feira</td><td>08:00</td></tr>
...
</table>
Multiple times per day are comma- or newline-separated within the time cell.
Day Maps
- Portuguese (br): Segunda-feira=1, Terça-feira=2, Quarta-feira=3, Quinta-feira=4, Sexta-feira=5, Sábado=6, Domingo=0
- Spanish (mx/ar/co/cl): Lunes=1, Martes=2, Miércoles=3, Jueves=4, Viernes=5, Sábado=6, Domingo=0
- English (gb): Monday=1, Tuesday=2, Wednesday=3, Thursday=4, Friday=5, Saturday=6, Sunday=0
- French (ch): Lundi=1, Mardi=2, Mercredi=3, Jeudi=4, Vendredi=5, Samedi=6, Dimanche=0
URL Exclusion Patterns
Skip non-church pages matching:
/about*/,/contact*/,/privacy*/,/cookie*/,/terms*//blog/,/news*/,/noticias/,/actualidad/- Any page with < 3 path segments (state/city index pages)
CLI Interface
npx tsx scripts/import-buscarmisas-network.ts --all --site br
npx tsx scripts/import-buscarmisas-network.ts --all --site mx --dry-run
npx tsx scripts/import-buscarmisas-network.ts --all --site gb --limit 300
npx tsx scripts/import-buscarmisas-network.ts --all --site ar --resume-from 500
npx tsx scripts/import-buscarmisas-network.ts --all --site co --job-id {uuid}
| Flag | Description |
|---|---|
--site <code> |
Required. Which site to import (br/mx/ar/co/cl/gb/ch) |
--all |
Run full import |
--dry-run |
Parse only, no DB writes |
--limit N |
Cap churches processed per run |
--resume-from N |
Skip first N church URLs |
--job-id <uuid> |
Bind to background job record |
Scheduler Integration
7 new sequential phases appended to the imports pipeline group in scripts/scheduler.ts:
{ name: 'buscarmisas-br', type: 'buscarmisas-network-import', config: { site: 'br', limit: 300 } },
{ name: 'buscarmisas-mx', type: 'buscarmisas-network-import', config: { site: 'mx', limit: 300 } },
{ name: 'buscarmisas-ar', type: 'buscarmisas-network-import', config: { site: 'ar', limit: 300 } },
{ name: 'buscarmisas-co', type: 'buscarmisas-network-import', config: { site: 'co', limit: 300 } },
{ name: 'buscarmisas-cl', type: 'buscarmisas-network-import', config: { site: 'cl', limit: 300 } },
{ name: 'buscarmisas-gb', type: 'buscarmisas-network-import', config: { site: 'gb', limit: 300 } },
{ name: 'buscarmisas-ch', type: 'buscarmisas-network-import', config: { site: 'ch', limit: 300 } },
New getJobCommand case:
case 'buscarmisas-network-import': {
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--all'];
if (config?.site) args.push('--site', String(config.site));
if (config?.limit) args.push('--limit', String(config.limit));
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
Error Handling
- Failed fetches: log and skip, continue with next church
- Parse failures: log and skip (don't crash)
- DB errors during upsert: log and skip
- Unhandled exception in
main(): catch → update job tofailed→ rethrow (same pattern as other importers)
Files Modified
- New:
scripts/import-buscarmisas-network.ts - Modified:
scripts/scheduler.ts— add 7 pipeline phases +getJobCommandcase