From f1a0d458e4aefd3e8dbdb309c0eeb3c58a7c12e5 Mon Sep 17 00:00:00 2001 From: albertfj114 Date: Tue, 17 Mar 2026 12:21:36 -0400 Subject: [PATCH] docs: add BuscarMisas network importer implementation plan --- ...2026-03-17-buscarmisas-network-importer.md | 1051 +++++++++++++++++ 1 file changed, 1051 insertions(+) create mode 100644 docs/superpowers/plans/2026-03-17-buscarmisas-network-importer.md diff --git a/docs/superpowers/plans/2026-03-17-buscarmisas-network-importer.md b/docs/superpowers/plans/2026-03-17-buscarmisas-network-importer.md new file mode 100644 index 0000000..dbb039b --- /dev/null +++ b/docs/superpowers/plans/2026-03-17-buscarmisas-network-importer.md @@ -0,0 +1,1051 @@ +# BuscarMisas Network Importer — Implementation Plan + +> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add a single config-driven importer that scrapes ~15,294 Catholic churches and mass schedules from 5 Latin American WordPress-based directories (Brazil, Mexico, Argentina, Colombia, Chile). + +**Architecture:** A `NETWORK_SITES` config map drives a single `import-buscarmisas-network.ts` script. Church HTML parsing extracts name, address, phone, coordinates, and weekly schedule. The external ID `{domain-slug}/{church-slug}` stored in a new `buscarmisasNetworkId` column prevents duplicate inserts on re-runs. + +**Tech Stack:** TypeScript, tsx, Prisma 7 + pg adapter, existing `church-matcher.ts` + `day-names.ts` utilities. + +--- + +## Chunk 1: Schema prerequisite + church-matcher update + +### Task 1: Add `buscarmisasNetworkId` to BethelGuide schema + +> ⚠️ BethelGuide is the schema source of truth. Never run `prisma migrate` in ScraperControl. + +**Files:** +- Modify (in BethelGuide repo): `prisma/schema.prisma` +- Modify (in BethelGuide repo): migration SQL file + +- [ ] **Step 1: In BethelGuide, open `prisma/schema.prisma` and add the column to the `Church` model** + +Add after the existing `discovermassId` line: +```prisma +buscarmisasNetworkId String? @unique @map("buscarmisas_network_id") +``` + +And add to the `@@index` block at the bottom of the `Church` model: +```prisma +@@index([buscarmisasNetworkId]) +``` + +- [ ] **Step 2: In BethelGuide, create and run the migration** + +```bash +npx prisma migrate dev --name add_buscarmisas_network_id +``` + +Expected: migration file created, column added to the shared PostgreSQL database. + +- [ ] **Step 3: Sync the updated schema to ScraperControl** + +```bash +cp prisma/schema.prisma ~/Documents/ScraperControl/prisma/schema.prisma +``` + +- [ ] **Step 4: Regenerate Prisma client in ScraperControl** + +```bash +cd ~/Documents/ScraperControl +npx prisma generate +``` + +Expected: no errors, `@prisma/client` regenerated with `buscarmisasNetworkId` field. + +- [ ] **Step 5: Verify the field is available** + +```bash +npx tsx -e " +import { Pool } from 'pg'; +import { PrismaPg } from '@prisma/adapter-pg'; +import { PrismaClient } from '@prisma/client'; +const pool = new Pool({ connectionString: process.env.DATABASE_URL }); +const prisma = new PrismaClient({ adapter: new PrismaPg(pool) }); +prisma.church.findFirst({ select: { buscarmisasNetworkId: true } }).then(r => { + console.log('buscarmisasNetworkId field present:', JSON.stringify(r)); + return prisma.\$disconnect().then(() => pool.end()); +}); +" +``` + +Expected: prints `buscarmisasNetworkId field present: null` or `{}` (not a type error). + +- [ ] **Step 6: Commit in ScraperControl** + +```bash +git add prisma/schema.prisma +git commit -m "chore: sync schema — add buscarmisasNetworkId column" +``` + +--- + +### Task 2: Update `church-matcher.ts` with new field + ID-match pass + +**Files:** +- Modify: `src/lib/church-matcher.ts` + +- [ ] **Step 1: Add `buscarmisasNetworkId` to `ExistingChurch` interface** + +In `src/lib/church-matcher.ts`, find the `ExistingChurch` interface (line ~11). The interface currently ends with `gottesdienstzeitenId: string | null;` followed by `source: string;`. Insert the two new fields immediately before the `source:` line: + +```ts + discovermassId: string | null; + buscarmisasNetworkId: string | null; + source: string; // ← already exists, shown for placement only +``` + +Note: `discovermassId` was missing from the interface (pre-existing gap) — adding it here ensures the `loadExistingChurches` select in Task 5 compiles correctly. + +- [ ] **Step 2: Add `buscarmisasNetworkId` to `ChurchCandidate` type** + +Find the `ChurchCandidate` type (line ~122). After the existing `horariosMisasId?: string;` and all other existing optional ID fields, add: + +```ts + discovermassId?: string; + buscarmisasNetworkId?: string; +``` + +- [ ] **Step 3: Add ID-match passes in `findDuplicateChurch`** + +The existing passes run 1–13 (osmId through gottesdienstzeitenId), with pass 14 being proximity+name at line ~259. Find the **Thirteenth pass** block (gottesdienstzeitenId, line ~251): + +```ts +// Thirteenth pass: exact gottesdienstzeitenId match +if (candidate.gottesdienstzeitenId) { + ... +} +``` + +Insert two new passes **after** it and **before** the proximity pass comment (`// Fourteenth pass: proximity + name match`): + +```ts +// Fourteenth pass: exact discovermassId match +if (candidate.discovermassId) { + const match = existingChurches.find( + (church) => church.discovermassId === candidate.discovermassId + ); + if (match) return match; +} + +// Fifteenth pass: exact buscarmisasNetworkId match +if (candidate.buscarmisasNetworkId) { + const match = existingChurches.find( + (church) => church.buscarmisasNetworkId === candidate.buscarmisasNetworkId + ); + if (match) return match; +} +``` + +Then update the existing proximity pass comment from `// Fourteenth pass:` to `// Sixteenth pass:`. + +- [ ] **Step 4: Verify TypeScript compiles** + +```bash +npx tsc --noEmit +``` + +Expected: 0 errors. + +- [ ] **Step 5: Commit** + +```bash +git add src/lib/church-matcher.ts +git commit -m "feat: add buscarmisasNetworkId (and discovermassId) to church-matcher interfaces and ID-match passes" +``` + +--- + +## Chunk 2: Parsing functions + +### Task 3: Write pure parsing functions with unit tests + +**Files:** +- Create: `scripts/import-buscarmisas-network.ts` (scaffold with parsing functions only) + +We write the parsing functions first as pure functions, then test them with real HTML snippets before wiring them to the HTTP layer. + +- [ ] **Step 1: Create `scripts/import-buscarmisas-network.ts` with the file header and types** + +```ts +#!/usr/bin/env tsx +/** + * Import Catholic churches and mass schedules from the BuscarMisas network. + * + * A group of 5 identical WordPress-based directories covering Latin America: + * - horariosmissa.com.br (Brazil, ~4,732 churches) + * - buscarmisas.com.mx (Mexico, ~3,950 churches) + * - horariosmisa.com.ar (Argentina, ~3,012 churches) + * - buscarmisas.co (Colombia, ~2,665 churches) + * - horariomisa.cl (Chile, ~935 churches) + * + * Usage: + * npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br + * npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --resume-from 500 + * npx tsx scripts/import-buscarmisas-network.ts --all + * npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run + */ + +import dotenv from 'dotenv'; +import path from 'path'; + +dotenv.config({ path: path.resolve(process.cwd(), '.env.local') }); +dotenv.config({ path: path.resolve(process.cwd(), '.env') }); + +import { Pool } from 'pg'; +import { PrismaPg } from '@prisma/adapter-pg'; +import { PrismaClient } from '@prisma/client'; + +import { findDuplicateChurch } from '../src/lib/church-matcher'; +import type { ExistingChurch } from '../src/lib/church-matcher'; +import { getDayNamesForCountry, buildDayPatterns } from '../src/scrapers/i18n/day-names'; + +// ─── Site Config ───────────────────────────────────────────────────────────── + +interface SiteConfig { + country: string; // ISO 3166-1 alpha-2 + language: 'pt' | 'es'; + sitemapType: 'page' | 'post'; +} + +const NETWORK_SITES: Record = { + 'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' }, + 'buscarmisas.com.mx': { country: 'MX', language: 'es', sitemapType: 'page' }, + 'horariosmisa.com.ar': { country: 'AR', language: 'es', sitemapType: 'page' }, + 'buscarmisas.co': { country: 'CO', language: 'es', sitemapType: 'page' }, + 'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' }, +}; + +// ─── Types ──────────────────────────────────────────────────────────────────── + +interface ParsedChurch { + name: string; + address: string | null; + city: string | null; + state: string | null; + phone: string | null; + lat: number; + lng: number; + externalId: string; + country: string; +} + +interface ParsedMass { + dayOfWeek: number; // 0 = Sunday, 6 = Saturday + time: string; // HH:MM 24-hour +} + +interface CLIArgs { + domain: string | null; + all: boolean; + dryRun: boolean; + resumeFrom: number; + limit: number | null; + jobId: string | null; +} + +interface ImportStats { + total: number; + created: number; + updated: number; + skipped: number; + errors: number; + massSchedulesCreated: number; +} +``` + +- [ ] **Step 2: Add `buildExternalId` helper** + +```ts +// ─── Helpers ───────────────────────────────────────────────────────────────── + +/** + * Build external ID for a church URL. + * Format: "{domain-slug}/{church-slug}" + * e.g. "horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios" + */ +export function buildExternalId(domain: string, churchUrl: string): string { + const domainSlug = domain.replace(/\./g, '-'); + // URL path: /{region}/{city}/{church-slug}/ + const segments = churchUrl.replace(/\/$/, '').split('/').filter(Boolean); + const churchSlug = segments[segments.length - 1] || ''; + return `${domainSlug}/${churchSlug}`; +} +``` + +- [ ] **Step 3: Verify `buildExternalId` manually** + +```bash +npx tsx -e " +import { buildExternalId } from './scripts/import-buscarmisas-network'; +console.log(buildExternalId('horariosmissa.com.br', 'https://horariosmissa.com.br/sao-paulo/sao-paulo/paroquia-nossa-senhora-dos-remedios/')); +// Expected: horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios +console.log(buildExternalId('buscarmisas.co', 'https://buscarmisas.co/bogota/bogota/parroquia-san-pedro/')); +// Expected: buscarmisas-co/parroquia-san-pedro +" +``` + +- [ ] **Step 4: Add `parseChurchPage` function** + +```ts +/** + * Parse church data from a church page HTML string. + * Returns null if name or coordinates cannot be extracted. + */ +export function parseChurchPage( + html: string, + domain: string, + churchUrl: string, + config: SiteConfig, +): ParsedChurch | null { + // Name: cell after Nome (PT) or Nombre (ES) + const nameLabel = config.language === 'pt' ? 'Nome' : 'Nombre'; + const nameMatch = html.match( + new RegExp(`${nameLabel}<\\/strong><\\/td>\\s*([^<]+)<\\/td>`, 'i') + ); + const name = nameMatch?.[1]?.trim() ?? ''; + if (!name) return null; + + // Coordinates: Google Maps iframe center= parameter + const coordMatch = html.match(/center=([-\d.]+)%2C([-\d.]+)/i); + if (!coordMatch) return null; + const lat = parseFloat(coordMatch[1]); + const lng = parseFloat(coordMatch[2]); + if (!isFinite(lat) || !isFinite(lng) || Math.abs(lat) > 90 || Math.abs(lng) > 180) return null; + + // Address: cell after Endereço (PT) or Dirección (ES) + const addrLabel = config.language === 'pt' ? 'Endere[çc]o' : 'Direcci[oó]n'; + const addrMatch = html.match( + new RegExp(`${addrLabel}<\\/strong><\\/td>\\s*([^<]+)<\\/td>`, 'i') + ); + const address = addrMatch?.[1]?.trim() ?? null; + + // Phone: tel: href + const phoneMatch = html.match(/href="tel:([^"]+)"/i); + const phone = phoneMatch?.[1]?.trim() ?? null; + + // City and state from URL path segments + const segments = churchUrl.replace(/\/$/, '').split('/').filter(Boolean); + // segments[2] = region/state, segments[3] = city (after domain), but URL is full URL + // URL form: https://{domain}/{state}/{city}/{slug}/ + const urlPath = new URL(churchUrl).pathname.split('/').filter(Boolean); + const state = urlPath[0] ? decodeURIComponent(urlPath[0].replace(/-/g, ' ')) : null; + const city = urlPath[1] ? decodeURIComponent(urlPath[1].replace(/-/g, ' ')) : null; + + return { + name, + address, + city, + state, + phone, + lat, + lng, + externalId: buildExternalId(domain, churchUrl), + country: config.country, + }; +} +``` + +- [ ] **Step 5: Add `parseMassSchedule` function** + +```ts +/** + * Parse the weekly mass schedule table from church page HTML. + * Table format: day-name cell | time cell (comma-separated times, "-" = no mass) + */ +export function parseMassSchedule(html: string, countryCode: string): ParsedMass[] { + const dayPatterns = buildDayPatterns(getDayNamesForCountry(countryCode)); + const results: ParsedMass[] = []; + + // Extract all cells as pairs [day, time] + const cells = [...html.matchAll(/]*>(.*?)<\/td>/gis)].map(m => + m[1].replace(/<[^>]+>/g, '').trim() + ); + + for (let i = 0; i + 1 < cells.length; i += 2) { + const dayCell = cells[i].toLowerCase(); + const timeCell = cells[i + 1]; + + const dayOfWeek = dayPatterns[dayCell]; + if (dayOfWeek === undefined) continue; + if (timeCell === '-' || !timeCell) continue; + + // Split comma-separated times: "10:00, 18:00" → ["10:00", "18:00"] + for (const rawTime of timeCell.split(',')) { + const time = rawTime.trim(); + if (/^\d{1,2}:\d{2}$/.test(time)) { + results.push({ dayOfWeek, time }); + } + } + } + return results; +} +``` + +- [ ] **Step 6: Test `parseChurchPage` and `parseMassSchedule` with real HTML** + +```bash +npx tsx -e " +import { parseChurchPage, parseMassSchedule } from './scripts/import-buscarmisas-network'; + +const NETWORK_SITES = { + 'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' }, +}; + +async function test() { + const res = await fetch('https://horariosmissa.com.br/sao-paulo/sao-paulo/paroquia-nossa-senhora-dos-remedios/'); + const html = await res.text(); + const config = NETWORK_SITES['horariosmissa.com.br']; + const parsed = parseChurchPage(html, 'horariosmissa.com.br', res.url, config); + console.log('Church:', JSON.stringify(parsed, null, 2)); + const masses = parseMassSchedule(html, config.country); + console.log('Masses:', JSON.stringify(masses, null, 2)); +} +test().catch(console.error); +" +``` + +Expected output (exact values are illustrative — website content may change): +``` +Church: { + "name": "Paróquia Nossa Senhora dos Remédios", // or current name + "address": "R. Ten. Azevedo, 182 ...", + "city": "sao paulo", + "state": "sao paulo", + "phone": "+55 11 ...", + "lat": -23.56..., + "lng": -46.62..., + "externalId": "horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios", + "country": "BR" +} +Masses: [ { "dayOfWeek": 2, "time": "17:00" }, ... ] +``` + +Verify: `church` is non-null, `lat`/`lng` are non-zero finite numbers, `externalId` matches `horariosmissa-com-br/{slug}` pattern, `masses` array is non-empty with dayOfWeek 0–6 and HH:MM times. + +- [ ] **Step 7: Test with a Spanish-language site (Mexico)** + +```bash +npx tsx -e " +import { parseChurchPage, parseMassSchedule } from './scripts/import-buscarmisas-network'; +const config = { country: 'MX', language: 'es', sitemapType: 'page' }; +const domain = 'buscarmisas.com.mx'; +const url = 'https://buscarmisas.com.mx/nuevo-leon/monterrey/parroquia-anunciacion-a-maria/'; +fetch(url).then(r => r.text()).then(html => { + console.log('Church:', JSON.stringify(parseChurchPage(html, domain, url, config), null, 2)); + console.log('Masses:', JSON.stringify(parseMassSchedule(html, config.country), null, 2)); +}).catch(console.error); +" +``` + +Expected: name, coordinates, and Spanish-language schedule rows parsed correctly. + +- [ ] **Step 8: Commit parsing scaffold** + +```bash +git add scripts/import-buscarmisas-network.ts +git commit -m "feat: add buscarmisas-network importer — parsing functions" +``` + +--- + +### Task 4: Sitemap discovery function + +**Files:** +- Modify: `scripts/import-buscarmisas-network.ts` + +- [ ] **Step 1: Add HTTP helpers** + +```ts +// ─── HTTP Helpers ───────────────────────────────────────────────────────────── + +const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)'; +const REQUEST_DELAY_MS = 2_000; +const DOMAIN_DELAY_MS = 5_000; + +async function fetchText(url: string): Promise { + const res = await fetch(url, { headers: { 'User-Agent': USER_AGENT } }); + if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`); + return res.text(); +} + +async function fetchWithRetry(url: string, retries = 3): Promise { + for (let attempt = 1; attempt <= retries; attempt++) { + try { + return await fetchText(url); + } catch (err) { + const msg = err instanceof Error ? err.message : String(err); + if (attempt === retries) throw err; + const isRetryable = msg.includes('429') || msg.includes('503'); + if (!isRetryable) throw err; + const backoff = attempt * 30_000; // 30s, 60s, 90s + console.warn(` [retry ${attempt}/${retries}] ${msg} — waiting ${backoff / 1000}s`); + await sleep(backoff); + } + } + throw new Error('unreachable'); +} + +function sleep(ms: number): Promise { + return new Promise(resolve => setTimeout(resolve, ms)); +} +``` + +- [ ] **Step 2: Add `getChurchUrls` function** + +```ts +// ─── Sitemap Discovery ──────────────────────────────────────────────────────── + +/** + * Fetch all church page URLs for a domain from its sitemap. + * Church URLs have exactly 3 path segments: /{region}/{city}/{slug}/ + */ +export async function getChurchUrls(domain: string, config: SiteConfig): Promise { + const indexUrl = `https://${domain}/sitemap_index.xml`; + console.log(`Fetching sitemap index: ${indexUrl}`); + const indexXml = await fetchWithRetry(indexUrl); + + // Extract child sitemap URLs matching the sitemapType + const childPattern = config.sitemapType === 'page' + ? /https:\/\/[^<]*\/page-sitemap\d*\.xml/g + : /https:\/\/[^<]*\/post-sitemap\.xml/g; + + const childUrls = [...indexXml.matchAll(childPattern)].map(m => m[0]); + console.log(` Found ${childUrls.length} child sitemaps`); + + const churchUrls: string[] = []; + for (const sitemapUrl of childUrls) { + const xml = await fetchWithRetry(sitemapUrl); + const locs = [...xml.matchAll(/([^<]+)<\/loc>/g)].map(m => m[1].trim()); + for (const loc of locs) { + // Church URLs: exactly 3 non-empty path segments after the domain + try { + const segments = new URL(loc).pathname.split('/').filter(Boolean); + if (segments.length === 3) { + churchUrls.push(loc); + } + } catch { /* skip malformed URLs */ } + } + } + + // Deduplicate + const unique = [...new Set(churchUrls)]; + console.log(` Total church URLs: ${unique.length}`); + return unique; +} +``` + +- [ ] **Step 3: Verify sitemap discovery against known counts** + +```bash +npx tsx -e " +import { getChurchUrls } from './scripts/import-buscarmisas-network'; +const NETWORK_SITES = { + 'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' }, + 'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' }, +}; +for (const [domain, config] of Object.entries(NETWORK_SITES)) { + const urls = await getChurchUrls(domain, config); + console.log(domain, '->', urls.length, 'churches'); + console.log(' Sample:', urls[0]); +} +" +``` + +Expected: Brazil ~4,700+ URLs, Chile ~930+ URLs. + +- [ ] **Step 4: Commit** + +```bash +git add scripts/import-buscarmisas-network.ts +git commit -m "feat: add buscarmisas-network importer — sitemap discovery" +``` + +--- + +## Chunk 3: Main importer + +### Task 5: DB helpers and church processing loop + +**Files:** +- Modify: `scripts/import-buscarmisas-network.ts` + +- [ ] **Step 1: Add DB connection and `loadExistingChurches`** + +At the top of the file (after dotenv), add the DB setup: + +```ts +const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass'; +console.log(`Connecting to: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`); +const pool = new Pool({ connectionString: dbUrl, ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined }); +const adapter = new PrismaPg(pool); +const prisma = new PrismaClient({ adapter }); +``` + +Then add `loadExistingChurches`: + +```ts +// ─── DB Helpers ─────────────────────────────────────────────────────────────── + +async function loadExistingChurches(country: string): Promise { + console.log(`Loading existing ${country} churches from DB...`); + const churches = await prisma.church.findMany({ + where: { country }, + select: { + id: true, name: true, latitude: true, longitude: true, + osmId: true, baiduId: true, masstimesId: true, + orarimesseId: true, massSchedulesPhId: true, philmassId: true, + horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true, + messesInfoId: true, bohosluzbyId: true, miserendId: true, + kerknetId: true, gottesdienstzeitenId: true, discovermassId: true, + buscarmisasNetworkId: true, + source: true, website: true, phone: true, address: true, country: true, + }, + }); + console.log(` Loaded ${churches.length} existing ${country} churches`); + return churches as ExistingChurch[]; +} +``` + +- [ ] **Step 2: Add `processChurch` function** + +```ts +// ─── Church Processing ──────────────────────────────────────────────────────── + +async function processChurch( + url: string, + domain: string, + config: SiteConfig, + existingChurches: ExistingChurch[], + args: CLIArgs, + stats: ImportStats, +): Promise { + stats.total++; + try { + const html = await fetchWithRetry(url); + const parsed = parseChurchPage(html, domain, url, config); + if (!parsed) { + console.log(` [skip] No name/coords: ${url}`); + stats.skipped++; + return; + } + + const masses = parseMassSchedule(html, config.country); + + if (args.dryRun) { + console.log(` [dry-run] ${parsed.name} — ${masses.length} masses`); + return; + } + + const candidate = { + name: parsed.name, + lat: parsed.lat, + lng: parsed.lng, + buscarmisasNetworkId: parsed.externalId, + }; + const duplicate = findDuplicateChurch(candidate, existingChurches); + + if (duplicate) { + const updateData: Record = { buscarmisasNetworkId: parsed.externalId }; + if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone; + if (parsed.lat !== 0 && duplicate.latitude === 0) { + updateData.latitude = parsed.lat; + updateData.longitude = parsed.lng; + } + + await prisma.$transaction(async (tx) => { + await tx.church.update({ where: { id: duplicate.id }, data: updateData }); + if (masses.length > 0) { + await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } }); + await tx.massSchedule.createMany({ + data: masses.map(m => ({ churchId: duplicate.id, dayOfWeek: m.dayOfWeek, time: m.time, language: config.language === 'pt' ? 'Portuguese' : 'Spanish', notes: null })), + }); + } + await tx.church.update({ where: { id: duplicate.id }, data: { lastScrapedAt: new Date() } }); + }); + duplicate.buscarmisasNetworkId = parsed.externalId; + stats.updated++; + } else { + const church = await prisma.church.create({ + data: { + name: parsed.name, + address: parsed.address, + city: parsed.city, + state: parsed.state, + country: parsed.country, + phone: parsed.phone, + latitude: parsed.lat, + longitude: parsed.lng, + buscarmisasNetworkId: parsed.externalId, + source: 'buscarmisas-network', + hasWebsite: false, + }, + }); + + existingChurches.push({ + id: church.id, name: parsed.name, latitude: parsed.lat, longitude: parsed.lng, + osmId: null, baiduId: null, masstimesId: null, orarimesseId: null, + massSchedulesPhId: null, philmassId: null, horariosMisasId: null, + mszeInfoId: null, weekdayMassesId: null, messesInfoId: null, + bohosluzbyId: null, miserendId: null, kerknetId: null, + gottesdienstzeitenId: null, discovermassId: null, + buscarmisasNetworkId: parsed.externalId, + source: 'buscarmisas-network', website: null, phone: parsed.phone, + address: parsed.address, country: parsed.country, + }); + + if (masses.length > 0) { + await prisma.massSchedule.createMany({ + data: masses.map(m => ({ + churchId: church.id, + dayOfWeek: m.dayOfWeek, + time: m.time, + language: config.language === 'pt' ? 'Portuguese' : 'Spanish', + notes: null, + })), + }); + await prisma.church.update({ where: { id: church.id }, data: { lastScrapedAt: new Date() } }); + } + stats.created++; + } + + stats.massSchedulesCreated += masses.length; + console.log( + ` [${duplicate ? 'update' : 'create'}] ${parsed.name} — ${masses.length} masses — ` + + `${stats.total} total (${stats.created}↑ ${stats.updated}↻ ${stats.errors}✗)` + ); + } catch (err) { + stats.errors++; + console.error(` [error] ${url}: ${err instanceof Error ? err.message : err}`); + } +} +``` + +- [ ] **Step 3: Compile-check the file so far** + +```bash +npx tsc --noEmit +``` + +Expected: 0 errors. + +- [ ] **Step 4: Commit** + +```bash +git add scripts/import-buscarmisas-network.ts +git commit -m "feat: add buscarmisas-network importer — DB helpers and church processing" +``` + +--- + +### Task 6: CLI parsing and `main()` function + +**Files:** +- Modify: `scripts/import-buscarmisas-network.ts` + +- [ ] **Step 1: Add `parseCLIArgs`** + +```ts +// ─── CLI ────────────────────────────────────────────────────────────────────── + +function parseCLIArgs(): CLIArgs { + const argv = process.argv.slice(2); + const result: CLIArgs = { domain: null, all: false, dryRun: false, resumeFrom: 0, limit: null, jobId: null }; + for (let i = 0; i < argv.length; i++) { + switch (argv[i]) { + case '--domain': result.domain = argv[++i]; break; + case '--all': result.all = true; break; + case '--dry-run': result.dryRun = true; break; + case '--resume-from': result.resumeFrom = parseInt(argv[++i], 10); break; + case '--limit': result.limit = parseInt(argv[++i], 10); break; + case '--job-id': result.jobId = argv[++i]; break; + } + } + return result; +} + +function validateArgs(args: CLIArgs): void { + if (!args.domain && !args.all) { + console.error('Usage:'); + console.error(' npx tsx scripts/import-buscarmisas-network.ts --domain '); + console.error(' npx tsx scripts/import-buscarmisas-network.ts --all'); + console.error('\nValid domains:', Object.keys(NETWORK_SITES).join(', ')); + process.exit(1); + } + if (args.domain && !NETWORK_SITES[args.domain]) { + console.error(`Unknown domain: ${args.domain}`); + console.error('Valid domains:', Object.keys(NETWORK_SITES).join(', ')); + process.exit(1); + } + if (args.all && args.resumeFrom > 0) { + console.error('--resume-from cannot be used with --all. Use --domain to resume a specific site.'); + process.exit(1); + } +} +``` + +- [ ] **Step 2: Add `runDomain` function** + +```ts +async function runDomain(domain: string, config: SiteConfig, args: CLIArgs): Promise { + const stats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 }; + + const allUrls = await getChurchUrls(domain, config); + const existingChurches = await loadExistingChurches(config.country); + + // Build set of already-imported IDs for fast skip + const importedIds = new Set( + existingChurches.filter(c => c.buscarmisasNetworkId).map(c => c.buscarmisasNetworkId!) + ); + + let candidateUrls = allUrls.slice(args.resumeFrom).filter(url => { + const externalId = buildExternalId(domain, url); + return !importedIds.has(externalId); + }); + if (args.limit !== null) candidateUrls = candidateUrls.slice(0, args.limit); + + console.log(`\n${domain}: ${allUrls.length} total | ${importedIds.size} already imported | ${candidateUrls.length} to process\n`); + + for (let i = 0; i < candidateUrls.length; i++) { + const url = candidateUrls[i]; + console.log(`[${i + 1}/${candidateUrls.length}] ${url}`); + await processChurch(url, domain, config, existingChurches, args, stats); + if (i < candidateUrls.length - 1) await sleep(REQUEST_DELAY_MS); + } + + return stats; +} +``` + +- [ ] **Step 3: Add `main()` function** + +```ts +// ─── Main ───────────────────────────────────────────────────────────────────── + +async function main() { + const args = parseCLIArgs(); + validateArgs(args); + + if (args.jobId) { + try { + await prisma.backgroundJob.update({ + where: { id: args.jobId }, + data: { status: 'running', startedAt: new Date() }, + }); + } catch { /* job may not exist yet */ } + } + + const domainsToRun: [string, SiteConfig][] = args.all + ? Object.entries(NETWORK_SITES) + : [[args.domain!, NETWORK_SITES[args.domain!]]]; + + const totalStats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 }; + + try { + for (let d = 0; d < domainsToRun.length; d++) { + const [domain, config] = domainsToRun[d]; + console.log(`\n${'─'.repeat(60)}`); + console.log(`Domain ${d + 1}/${domainsToRun.length}: ${domain} (${config.country})`); + console.log('─'.repeat(60)); + const stats = await runDomain(domain, config, args); + totalStats.total += stats.total; + totalStats.created += stats.created; + totalStats.updated += stats.updated; + totalStats.skipped += stats.skipped; + totalStats.errors += stats.errors; + totalStats.massSchedulesCreated += stats.massSchedulesCreated; + if (d < domainsToRun.length - 1) await sleep(DOMAIN_DELAY_MS); + } + } finally { + console.log('\n─── Import Complete ───────────────────────────────────────'); + console.log(`Total processed: ${totalStats.total}`); + console.log(`Created: ${totalStats.created}`); + console.log(`Updated: ${totalStats.updated}`); + console.log(`Skipped: ${totalStats.skipped}`); + console.log(`Errors: ${totalStats.errors}`); + console.log(`Mass schedules: ${totalStats.massSchedulesCreated}`); + + if (args.jobId) { + const status = totalStats.errors > totalStats.total * 0.1 ? 'failed' : 'completed'; + try { + await prisma.backgroundJob.update({ + where: { id: args.jobId }, + data: { + status, + completedAt: new Date(), + processed: totalStats.total, + succeeded: totalStats.created + totalStats.updated, + failed: totalStats.errors, + itemsFound: totalStats.massSchedulesCreated, + }, + }); + } catch { /* ignore */ } + } + + await prisma.$disconnect(); + await pool.end(); + } +} + +main().catch(err => { + console.error('Fatal error:', err); + process.exit(1); +}); +``` + +- [ ] **Step 4: Final compile check** + +```bash +npx tsc --noEmit +``` + +Expected: 0 errors. + +- [ ] **Step 5: Commit** + +```bash +git add scripts/import-buscarmisas-network.ts +git commit -m "feat: add buscarmisas-network importer — CLI + main loop" +``` + +--- + +## Chunk 4: Integration + smoke test + +### Task 7: package.json and scheduler integration + +**Files:** +- Modify: `package.json` +- Modify: `scripts/scheduler.ts` + +- [ ] **Step 1: Add npm script to `package.json`** + +In the `"scripts"` section, add after `"import:gcatholic"`: +```json +"import:buscarmisas-network": "tsx scripts/import-buscarmisas-network.ts", +``` + +- [ ] **Step 2: Add 5 case blocks to `getJobCommand` in `scheduler.ts`** + +In `scripts/scheduler.ts`, find the `case 'discovermass-import':` block (around line 240). After it, before the `default:` case, add: + +```ts +case 'buscarmisas-network-BR': { + const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariosmissa.com.br']; + if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom)); + return { command: 'npx', args }; +} +case 'buscarmisas-network-MX': { + const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'buscarmisas.com.mx']; + if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom)); + return { command: 'npx', args }; +} +case 'buscarmisas-network-AR': { + const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariosmisa.com.ar']; + if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom)); + return { command: 'npx', args }; +} +case 'buscarmisas-network-CO': { + const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'buscarmisas.co']; + if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom)); + return { command: 'npx', args }; +} +case 'buscarmisas-network-CL': { + const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariomisa.cl']; + if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom)); + return { command: 'npx', args }; +} +``` + +- [ ] **Step 3: Add 5 pipeline phases to `PIPELINE_GROUPS[0].phases` in `scheduler.ts`** + +In `scripts/scheduler.ts`, find the `PIPELINE_GROUPS` array. Inside the first group (`name: 'imports'`), add after the `discovermass-import` phase: + +```ts +{ name: 'buscarmisas-network-BR', type: 'buscarmisas-network-BR', config: {} }, +{ name: 'buscarmisas-network-MX', type: 'buscarmisas-network-MX', config: {} }, +{ name: 'buscarmisas-network-AR', type: 'buscarmisas-network-AR', config: {} }, +{ name: 'buscarmisas-network-CO', type: 'buscarmisas-network-CO', config: {} }, +{ name: 'buscarmisas-network-CL', type: 'buscarmisas-network-CL', config: {} }, +``` + +- [ ] **Step 4: TypeScript compile check** + +```bash +npx tsc --noEmit +``` + +Expected: 0 errors. + +- [ ] **Step 5: Commit** + +```bash +git add package.json scripts/scheduler.ts +git commit -m "feat: add buscarmisas-network to package.json and scheduler pipeline" +``` + +--- + +### Task 8: Smoke test against live sites + +- [ ] **Step 1: Dry-run Brazil (verifies parsing + sitemap, no DB writes)** + +```bash +npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run +``` + +Expected: prints church names with mass counts, no DB errors, >4,000 URLs discovered. + +- [ ] **Step 2: Live run — 3 churches from Brazil** + +```bash +npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --limit 3 +``` + +Expected: 3 churches created in DB, mass schedules created, no errors. + +- [ ] **Step 3: Verify in DB** + +```bash +npx tsx -e " +import { Pool } from 'pg'; +import { PrismaPg } from '@prisma/adapter-pg'; +import { PrismaClient } from '@prisma/client'; +const pool = new Pool({ connectionString: process.env.DATABASE_URL }); +const prisma = new PrismaClient({ adapter: new PrismaPg(pool) }); +const churches = await prisma.church.findMany({ + where: { source: 'buscarmisas-network' }, + select: { name: true, country: true, buscarmisasNetworkId: true, latitude: true, longitude: true }, + take: 5, +}); +console.table(churches); +const massCount = await prisma.massSchedule.count({ + where: { church: { source: 'buscarmisas-network' } }, +}); +console.log('Mass schedules created:', massCount); +await prisma.\$disconnect(); await pool.end(); +" +``` + +Expected: 3 rows with source `buscarmisas-network`, valid lat/lng, `buscarmisasNetworkId` populated. + +- [ ] **Step 4: Test idempotency (re-run should skip already-imported)** + +Re-run the same limited test. Expected: `0 to process` (all skipped via the `importedIds` Set). + +- [ ] **Step 5: Dry-run Chile (verifies post-sitemap path)** + +```bash +npx tsx scripts/import-buscarmisas-network.ts --domain horariomisa.cl --dry-run +``` + +Expected: ~935 URLs discovered, Spanish day names parsed correctly. + +- [ ] **Step 6: Final commit** + +```bash +git add scripts/import-buscarmisas-network.ts package.json scripts/scheduler.ts +git commit -m "feat: complete buscarmisas-network importer — Brazil, Mexico, Argentina, Colombia, Chile" +```