Files
ScraperControl/docs/superpowers/plans/2026-03-17-buscarmisas-network-importer.md

36 KiB
Raw Blame History

BuscarMisas Network Importer — Implementation Plan

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add a single config-driven importer that scrapes ~15,294 Catholic churches and mass schedules from 5 Latin American WordPress-based directories (Brazil, Mexico, Argentina, Colombia, Chile).

Architecture: A NETWORK_SITES config map drives a single import-buscarmisas-network.ts script. Church HTML parsing extracts name, address, phone, coordinates, and weekly schedule. The external ID {domain-slug}/{church-slug} stored in a new buscarmisasNetworkId column prevents duplicate inserts on re-runs.

Tech Stack: TypeScript, tsx, Prisma 7 + pg adapter, existing church-matcher.ts + day-names.ts utilities.


Chunk 1: Schema prerequisite + church-matcher update

Task 1: Add buscarmisasNetworkId to BethelGuide schema

⚠️ BethelGuide is the schema source of truth. Never run prisma migrate in ScraperControl.

Files:

  • Modify (in BethelGuide repo): prisma/schema.prisma

  • Modify (in BethelGuide repo): migration SQL file

  • Step 1: In BethelGuide, open prisma/schema.prisma and add the column to the Church model

Add after the existing discovermassId line:

buscarmisasNetworkId  String?  @unique  @map("buscarmisas_network_id")

And add to the @@index block at the bottom of the Church model:

@@index([buscarmisasNetworkId])
  • Step 2: In BethelGuide, create and run the migration
npx prisma migrate dev --name add_buscarmisas_network_id

Expected: migration file created, column added to the shared PostgreSQL database.

  • Step 3: Sync the updated schema to ScraperControl
cp prisma/schema.prisma ~/Documents/ScraperControl/prisma/schema.prisma
  • Step 4: Regenerate Prisma client in ScraperControl
cd ~/Documents/ScraperControl
npx prisma generate

Expected: no errors, @prisma/client regenerated with buscarmisasNetworkId field.

  • Step 5: Verify the field is available
npx tsx -e "
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) });
prisma.church.findFirst({ select: { buscarmisasNetworkId: true } }).then(r => {
  console.log('buscarmisasNetworkId field present:', JSON.stringify(r));
  return prisma.\$disconnect().then(() => pool.end());
});
"

Expected: prints buscarmisasNetworkId field present: null or {} (not a type error).

  • Step 6: Commit in ScraperControl
git add prisma/schema.prisma
git commit -m "chore: sync schema — add buscarmisasNetworkId column"

Task 2: Update church-matcher.ts with new field + ID-match pass

Files:

  • Modify: src/lib/church-matcher.ts

  • Step 1: Add buscarmisasNetworkId to ExistingChurch interface

In src/lib/church-matcher.ts, find the ExistingChurch interface (line ~11). The interface currently ends with gottesdienstzeitenId: string | null; followed by source: string;. Insert the two new fields immediately before the source: line:

  discovermassId: string | null;
  buscarmisasNetworkId: string | null;
  source: string;   // ← already exists, shown for placement only

Note: discovermassId was missing from the interface (pre-existing gap) — adding it here ensures the loadExistingChurches select in Task 5 compiles correctly.

  • Step 2: Add buscarmisasNetworkId to ChurchCandidate type

Find the ChurchCandidate type (line ~122). After the existing horariosMisasId?: string; and all other existing optional ID fields, add:

  discovermassId?: string;
  buscarmisasNetworkId?: string;
  • Step 3: Add ID-match passes in findDuplicateChurch

The existing passes run 113 (osmId through gottesdienstzeitenId), with pass 14 being proximity+name at line ~259. Find the Thirteenth pass block (gottesdienstzeitenId, line ~251):

// Thirteenth pass: exact gottesdienstzeitenId match
if (candidate.gottesdienstzeitenId) {
  ...
}

Insert two new passes after it and before the proximity pass comment (// Fourteenth pass: proximity + name match):

// Fourteenth pass: exact discovermassId match
if (candidate.discovermassId) {
  const match = existingChurches.find(
    (church) => church.discovermassId === candidate.discovermassId
  );
  if (match) return match;
}

// Fifteenth pass: exact buscarmisasNetworkId match
if (candidate.buscarmisasNetworkId) {
  const match = existingChurches.find(
    (church) => church.buscarmisasNetworkId === candidate.buscarmisasNetworkId
  );
  if (match) return match;
}

Then update the existing proximity pass comment from // Fourteenth pass: to // Sixteenth pass:.

  • Step 4: Verify TypeScript compiles
npx tsc --noEmit

Expected: 0 errors.

  • Step 5: Commit
git add src/lib/church-matcher.ts
git commit -m "feat: add buscarmisasNetworkId (and discovermassId) to church-matcher interfaces and ID-match passes"

Chunk 2: Parsing functions

Task 3: Write pure parsing functions with unit tests

Files:

  • Create: scripts/import-buscarmisas-network.ts (scaffold with parsing functions only)

We write the parsing functions first as pure functions, then test them with real HTML snippets before wiring them to the HTTP layer.

  • Step 1: Create scripts/import-buscarmisas-network.ts with the file header and types
#!/usr/bin/env tsx
/**
 * Import Catholic churches and mass schedules from the BuscarMisas network.
 *
 * A group of 5 identical WordPress-based directories covering Latin America:
 *   - horariosmissa.com.br  (Brazil,    ~4,732 churches)
 *   - buscarmisas.com.mx   (Mexico,    ~3,950 churches)
 *   - horariosmisa.com.ar  (Argentina, ~3,012 churches)
 *   - buscarmisas.co       (Colombia,  ~2,665 churches)
 *   - horariomisa.cl       (Chile,       ~935 churches)
 *
 * Usage:
 *   npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br
 *   npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --resume-from 500
 *   npx tsx scripts/import-buscarmisas-network.ts --all
 *   npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run
 */

import dotenv from 'dotenv';
import path from 'path';

dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });

import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';

import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';
import { getDayNamesForCountry, buildDayPatterns } from '../src/scrapers/i18n/day-names';

// ─── Site Config ─────────────────────────────────────────────────────────────

interface SiteConfig {
  country: string;          // ISO 3166-1 alpha-2
  language: 'pt' | 'es';
  sitemapType: 'page' | 'post';
}

const NETWORK_SITES: Record<string, SiteConfig> = {
  'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
  'buscarmisas.com.mx':   { country: 'MX', language: 'es', sitemapType: 'page' },
  'horariosmisa.com.ar':  { country: 'AR', language: 'es', sitemapType: 'page' },
  'buscarmisas.co':       { country: 'CO', language: 'es', sitemapType: 'page' },
  'horariomisa.cl':       { country: 'CL', language: 'es', sitemapType: 'post' },
};

// ─── Types ────────────────────────────────────────────────────────────────────

interface ParsedChurch {
  name: string;
  address: string | null;
  city: string | null;
  state: string | null;
  phone: string | null;
  lat: number;
  lng: number;
  externalId: string;
  country: string;
}

interface ParsedMass {
  dayOfWeek: number;  // 0 = Sunday, 6 = Saturday
  time: string;       // HH:MM 24-hour
}

interface CLIArgs {
  domain: string | null;
  all: boolean;
  dryRun: boolean;
  resumeFrom: number;
  limit: number | null;
  jobId: string | null;
}

interface ImportStats {
  total: number;
  created: number;
  updated: number;
  skipped: number;
  errors: number;
  massSchedulesCreated: number;
}
  • Step 2: Add buildExternalId helper
// ─── Helpers ─────────────────────────────────────────────────────────────────

/**
 * Build external ID for a church URL.
 * Format: "{domain-slug}/{church-slug}"
 * e.g. "horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios"
 */
export function buildExternalId(domain: string, churchUrl: string): string {
  const domainSlug = domain.replace(/\./g, '-');
  // URL path: /{region}/{city}/{church-slug}/
  const segments = churchUrl.replace(/\/$/, '').split('/').filter(Boolean);
  const churchSlug = segments[segments.length - 1] || '';
  return `${domainSlug}/${churchSlug}`;
}
  • Step 3: Verify buildExternalId manually
npx tsx -e "
import { buildExternalId } from './scripts/import-buscarmisas-network';
console.log(buildExternalId('horariosmissa.com.br', 'https://horariosmissa.com.br/sao-paulo/sao-paulo/paroquia-nossa-senhora-dos-remedios/'));
// Expected: horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios
console.log(buildExternalId('buscarmisas.co', 'https://buscarmisas.co/bogota/bogota/parroquia-san-pedro/'));
// Expected: buscarmisas-co/parroquia-san-pedro
"
  • Step 4: Add parseChurchPage function
/**
 * Parse church data from a church page HTML string.
 * Returns null if name or coordinates cannot be extracted.
 */
export function parseChurchPage(
  html: string,
  domain: string,
  churchUrl: string,
  config: SiteConfig,
): ParsedChurch | null {
  // Name: cell after <strong>Nome</strong> (PT) or <strong>Nombre</strong> (ES)
  const nameLabel = config.language === 'pt' ? 'Nome' : 'Nombre';
  const nameMatch = html.match(
    new RegExp(`<strong>${nameLabel}<\\/strong><\\/td>\\s*<td>([^<]+)<\\/td>`, 'i')
  );
  const name = nameMatch?.[1]?.trim() ?? '';
  if (!name) return null;

  // Coordinates: Google Maps iframe center= parameter
  const coordMatch = html.match(/center=([-\d.]+)%2C([-\d.]+)/i);
  if (!coordMatch) return null;
  const lat = parseFloat(coordMatch[1]);
  const lng = parseFloat(coordMatch[2]);
  if (!isFinite(lat) || !isFinite(lng) || Math.abs(lat) > 90 || Math.abs(lng) > 180) return null;

  // Address: cell after <strong>Endereço</strong> (PT) or <strong>Dirección</strong> (ES)
  const addrLabel = config.language === 'pt' ? 'Endere[çc]o' : 'Direcci[oó]n';
  const addrMatch = html.match(
    new RegExp(`<strong>${addrLabel}<\\/strong><\\/td>\\s*<td>([^<]+)<\\/td>`, 'i')
  );
  const address = addrMatch?.[1]?.trim() ?? null;

  // Phone: tel: href
  const phoneMatch = html.match(/href="tel:([^"]+)"/i);
  const phone = phoneMatch?.[1]?.trim() ?? null;

  // City and state from URL path segments
  const segments = churchUrl.replace(/\/$/, '').split('/').filter(Boolean);
  // segments[2] = region/state, segments[3] = city (after domain), but URL is full URL
  // URL form: https://{domain}/{state}/{city}/{slug}/
  const urlPath = new URL(churchUrl).pathname.split('/').filter(Boolean);
  const state = urlPath[0] ? decodeURIComponent(urlPath[0].replace(/-/g, ' ')) : null;
  const city  = urlPath[1] ? decodeURIComponent(urlPath[1].replace(/-/g, ' ')) : null;

  return {
    name,
    address,
    city,
    state,
    phone,
    lat,
    lng,
    externalId: buildExternalId(domain, churchUrl),
    country: config.country,
  };
}
  • Step 5: Add parseMassSchedule function
/**
 * Parse the weekly mass schedule table from church page HTML.
 * Table format: day-name cell | time cell (comma-separated times, "-" = no mass)
 */
export function parseMassSchedule(html: string, countryCode: string): ParsedMass[] {
  const dayPatterns = buildDayPatterns(getDayNamesForCountry(countryCode));
  const results: ParsedMass[] = [];

  // Extract all <td> cells as pairs [day, time]
  const cells = [...html.matchAll(/<td[^>]*>(.*?)<\/td>/gis)].map(m =>
    m[1].replace(/<[^>]+>/g, '').trim()
  );

  for (let i = 0; i + 1 < cells.length; i += 2) {
    const dayCell = cells[i].toLowerCase();
    const timeCell = cells[i + 1];

    const dayOfWeek = dayPatterns[dayCell];
    if (dayOfWeek === undefined) continue;
    if (timeCell === '-' || !timeCell) continue;

    // Split comma-separated times: "10:00, 18:00" → ["10:00", "18:00"]
    for (const rawTime of timeCell.split(',')) {
      const time = rawTime.trim();
      if (/^\d{1,2}:\d{2}$/.test(time)) {
        results.push({ dayOfWeek, time });
      }
    }
  }
  return results;
}
  • Step 6: Test parseChurchPage and parseMassSchedule with real HTML
npx tsx -e "
import { parseChurchPage, parseMassSchedule } from './scripts/import-buscarmisas-network';

const NETWORK_SITES = {
  'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
};

async function test() {
  const res = await fetch('https://horariosmissa.com.br/sao-paulo/sao-paulo/paroquia-nossa-senhora-dos-remedios/');
  const html = await res.text();
  const config = NETWORK_SITES['horariosmissa.com.br'];
  const parsed = parseChurchPage(html, 'horariosmissa.com.br', res.url, config);
  console.log('Church:', JSON.stringify(parsed, null, 2));
  const masses = parseMassSchedule(html, config.country);
  console.log('Masses:', JSON.stringify(masses, null, 2));
}
test().catch(console.error);
"

Expected output (exact values are illustrative — website content may change):

Church: {
  "name": "Paróquia Nossa Senhora dos Remédios",   // or current name
  "address": "R. Ten. Azevedo, 182 ...",
  "city": "sao paulo",
  "state": "sao paulo",
  "phone": "+55 11 ...",
  "lat": -23.56...,
  "lng": -46.62...,
  "externalId": "horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios",
  "country": "BR"
}
Masses: [ { "dayOfWeek": 2, "time": "17:00" }, ... ]

Verify: church is non-null, lat/lng are non-zero finite numbers, externalId matches horariosmissa-com-br/{slug} pattern, masses array is non-empty with dayOfWeek 06 and HH:MM times.

  • Step 7: Test with a Spanish-language site (Mexico)
npx tsx -e "
import { parseChurchPage, parseMassSchedule } from './scripts/import-buscarmisas-network';
const config = { country: 'MX', language: 'es', sitemapType: 'page' };
const domain = 'buscarmisas.com.mx';
const url = 'https://buscarmisas.com.mx/nuevo-leon/monterrey/parroquia-anunciacion-a-maria/';
fetch(url).then(r => r.text()).then(html => {
  console.log('Church:', JSON.stringify(parseChurchPage(html, domain, url, config), null, 2));
  console.log('Masses:', JSON.stringify(parseMassSchedule(html, config.country), null, 2));
}).catch(console.error);
"

Expected: name, coordinates, and Spanish-language schedule rows parsed correctly.

  • Step 8: Commit parsing scaffold
git add scripts/import-buscarmisas-network.ts
git commit -m "feat: add buscarmisas-network importer — parsing functions"

Task 4: Sitemap discovery function

Files:

  • Modify: scripts/import-buscarmisas-network.ts

  • Step 1: Add HTTP helpers

// ─── HTTP Helpers ─────────────────────────────────────────────────────────────

const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 2_000;
const DOMAIN_DELAY_MS  = 5_000;

async function fetchText(url: string): Promise<string> {
  const res = await fetch(url, { headers: { 'User-Agent': USER_AGENT } });
  if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`);
  return res.text();
}

async function fetchWithRetry(url: string, retries = 3): Promise<string> {
  for (let attempt = 1; attempt <= retries; attempt++) {
    try {
      return await fetchText(url);
    } catch (err) {
      const msg = err instanceof Error ? err.message : String(err);
      if (attempt === retries) throw err;
      const isRetryable = msg.includes('429') || msg.includes('503');
      if (!isRetryable) throw err;
      const backoff = attempt * 30_000; // 30s, 60s, 90s
      console.warn(`  [retry ${attempt}/${retries}] ${msg} — waiting ${backoff / 1000}s`);
      await sleep(backoff);
    }
  }
  throw new Error('unreachable');
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}
  • Step 2: Add getChurchUrls function
// ─── Sitemap Discovery ────────────────────────────────────────────────────────

/**
 * Fetch all church page URLs for a domain from its sitemap.
 * Church URLs have exactly 3 path segments: /{region}/{city}/{slug}/
 */
export async function getChurchUrls(domain: string, config: SiteConfig): Promise<string[]> {
  const indexUrl = `https://${domain}/sitemap_index.xml`;
  console.log(`Fetching sitemap index: ${indexUrl}`);
  const indexXml = await fetchWithRetry(indexUrl);

  // Extract child sitemap URLs matching the sitemapType
  const childPattern = config.sitemapType === 'page'
    ? /https:\/\/[^<]*\/page-sitemap\d*\.xml/g
    : /https:\/\/[^<]*\/post-sitemap\.xml/g;

  const childUrls = [...indexXml.matchAll(childPattern)].map(m => m[0]);
  console.log(`  Found ${childUrls.length} child sitemaps`);

  const churchUrls: string[] = [];
  for (const sitemapUrl of childUrls) {
    const xml = await fetchWithRetry(sitemapUrl);
    const locs = [...xml.matchAll(/<loc>([^<]+)<\/loc>/g)].map(m => m[1].trim());
    for (const loc of locs) {
      // Church URLs: exactly 3 non-empty path segments after the domain
      try {
        const segments = new URL(loc).pathname.split('/').filter(Boolean);
        if (segments.length === 3) {
          churchUrls.push(loc);
        }
      } catch { /* skip malformed URLs */ }
    }
  }

  // Deduplicate
  const unique = [...new Set(churchUrls)];
  console.log(`  Total church URLs: ${unique.length}`);
  return unique;
}
  • Step 3: Verify sitemap discovery against known counts
npx tsx -e "
import { getChurchUrls } from './scripts/import-buscarmisas-network';
const NETWORK_SITES = {
  'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
  'horariomisa.cl':       { country: 'CL', language: 'es', sitemapType: 'post' },
};
for (const [domain, config] of Object.entries(NETWORK_SITES)) {
  const urls = await getChurchUrls(domain, config);
  console.log(domain, '->', urls.length, 'churches');
  console.log('  Sample:', urls[0]);
}
"

Expected: Brazil ~4,700+ URLs, Chile ~930+ URLs.

  • Step 4: Commit
git add scripts/import-buscarmisas-network.ts
git commit -m "feat: add buscarmisas-network importer — sitemap discovery"

Chunk 3: Main importer

Task 5: DB helpers and church processing loop

Files:

  • Modify: scripts/import-buscarmisas-network.ts

  • Step 1: Add DB connection and loadExistingChurches

At the top of the file (after dotenv), add the DB setup:

const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({ connectionString: dbUrl, ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined });
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });

Then add loadExistingChurches:

// ─── DB Helpers ───────────────────────────────────────────────────────────────

async function loadExistingChurches(country: string): Promise<ExistingChurch[]> {
  console.log(`Loading existing ${country} churches from DB...`);
  const churches = await prisma.church.findMany({
    where: { country },
    select: {
      id: true, name: true, latitude: true, longitude: true,
      osmId: true, baiduId: true, masstimesId: true,
      orarimesseId: true, massSchedulesPhId: true, philmassId: true,
      horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true,
      messesInfoId: true, bohosluzbyId: true, miserendId: true,
      kerknetId: true, gottesdienstzeitenId: true, discovermassId: true,
      buscarmisasNetworkId: true,
      source: true, website: true, phone: true, address: true, country: true,
    },
  });
  console.log(`  Loaded ${churches.length} existing ${country} churches`);
  return churches as ExistingChurch[];
}
  • Step 2: Add processChurch function
// ─── Church Processing ────────────────────────────────────────────────────────

async function processChurch(
  url: string,
  domain: string,
  config: SiteConfig,
  existingChurches: ExistingChurch[],
  args: CLIArgs,
  stats: ImportStats,
): Promise<void> {
  stats.total++;
  try {
    const html = await fetchWithRetry(url);
    const parsed = parseChurchPage(html, domain, url, config);
    if (!parsed) {
      console.log(`  [skip] No name/coords: ${url}`);
      stats.skipped++;
      return;
    }

    const masses = parseMassSchedule(html, config.country);

    if (args.dryRun) {
      console.log(`  [dry-run] ${parsed.name}${masses.length} masses`);
      return;
    }

    const candidate = {
      name: parsed.name,
      lat: parsed.lat,
      lng: parsed.lng,
      buscarmisasNetworkId: parsed.externalId,
    };
    const duplicate = findDuplicateChurch(candidate, existingChurches);

    if (duplicate) {
      const updateData: Record<string, unknown> = { buscarmisasNetworkId: parsed.externalId };
      if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone;
      if (parsed.lat !== 0 && duplicate.latitude === 0) {
        updateData.latitude = parsed.lat;
        updateData.longitude = parsed.lng;
      }

      await prisma.$transaction(async (tx) => {
        await tx.church.update({ where: { id: duplicate.id }, data: updateData });
        if (masses.length > 0) {
          await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } });
          await tx.massSchedule.createMany({
            data: masses.map(m => ({ churchId: duplicate.id, dayOfWeek: m.dayOfWeek, time: m.time, language: config.language === 'pt' ? 'Portuguese' : 'Spanish', notes: null })),
          });
        }
        await tx.church.update({ where: { id: duplicate.id }, data: { lastScrapedAt: new Date() } });
      });
      duplicate.buscarmisasNetworkId = parsed.externalId;
      stats.updated++;
    } else {
      const church = await prisma.church.create({
        data: {
          name: parsed.name,
          address: parsed.address,
          city: parsed.city,
          state: parsed.state,
          country: parsed.country,
          phone: parsed.phone,
          latitude: parsed.lat,
          longitude: parsed.lng,
          buscarmisasNetworkId: parsed.externalId,
          source: 'buscarmisas-network',
          hasWebsite: false,
        },
      });

      existingChurches.push({
        id: church.id, name: parsed.name, latitude: parsed.lat, longitude: parsed.lng,
        osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
        massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
        mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
        bohosluzbyId: null, miserendId: null, kerknetId: null,
        gottesdienstzeitenId: null, discovermassId: null,
        buscarmisasNetworkId: parsed.externalId,
        source: 'buscarmisas-network', website: null, phone: parsed.phone,
        address: parsed.address, country: parsed.country,
      });

      if (masses.length > 0) {
        await prisma.massSchedule.createMany({
          data: masses.map(m => ({
            churchId: church.id,
            dayOfWeek: m.dayOfWeek,
            time: m.time,
            language: config.language === 'pt' ? 'Portuguese' : 'Spanish',
            notes: null,
          })),
        });
        await prisma.church.update({ where: { id: church.id }, data: { lastScrapedAt: new Date() } });
      }
      stats.created++;
    }

    stats.massSchedulesCreated += masses.length;
    console.log(
      `  [${duplicate ? 'update' : 'create'}] ${parsed.name}${masses.length} masses — ` +
      `${stats.total} total (${stats.created}${stats.updated}${stats.errors}✗)`
    );
  } catch (err) {
    stats.errors++;
    console.error(`  [error] ${url}: ${err instanceof Error ? err.message : err}`);
  }
}
  • Step 3: Compile-check the file so far
npx tsc --noEmit

Expected: 0 errors.

  • Step 4: Commit
git add scripts/import-buscarmisas-network.ts
git commit -m "feat: add buscarmisas-network importer — DB helpers and church processing"

Task 6: CLI parsing and main() function

Files:

  • Modify: scripts/import-buscarmisas-network.ts

  • Step 1: Add parseCLIArgs

// ─── CLI ──────────────────────────────────────────────────────────────────────

function parseCLIArgs(): CLIArgs {
  const argv = process.argv.slice(2);
  const result: CLIArgs = { domain: null, all: false, dryRun: false, resumeFrom: 0, limit: null, jobId: null };
  for (let i = 0; i < argv.length; i++) {
    switch (argv[i]) {
      case '--domain':      result.domain     = argv[++i]; break;
      case '--all':         result.all        = true;      break;
      case '--dry-run':     result.dryRun     = true;      break;
      case '--resume-from': result.resumeFrom = parseInt(argv[++i], 10); break;
      case '--limit':       result.limit      = parseInt(argv[++i], 10); break;
      case '--job-id':      result.jobId      = argv[++i]; break;
    }
  }
  return result;
}

function validateArgs(args: CLIArgs): void {
  if (!args.domain && !args.all) {
    console.error('Usage:');
    console.error('  npx tsx scripts/import-buscarmisas-network.ts --domain <domain>');
    console.error('  npx tsx scripts/import-buscarmisas-network.ts --all');
    console.error('\nValid domains:', Object.keys(NETWORK_SITES).join(', '));
    process.exit(1);
  }
  if (args.domain && !NETWORK_SITES[args.domain]) {
    console.error(`Unknown domain: ${args.domain}`);
    console.error('Valid domains:', Object.keys(NETWORK_SITES).join(', '));
    process.exit(1);
  }
  if (args.all && args.resumeFrom > 0) {
    console.error('--resume-from cannot be used with --all. Use --domain to resume a specific site.');
    process.exit(1);
  }
}
  • Step 2: Add runDomain function
async function runDomain(domain: string, config: SiteConfig, args: CLIArgs): Promise<ImportStats> {
  const stats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 };

  const allUrls = await getChurchUrls(domain, config);
  const existingChurches = await loadExistingChurches(config.country);

  // Build set of already-imported IDs for fast skip
  const importedIds = new Set(
    existingChurches.filter(c => c.buscarmisasNetworkId).map(c => c.buscarmisasNetworkId!)
  );

  let candidateUrls = allUrls.slice(args.resumeFrom).filter(url => {
    const externalId = buildExternalId(domain, url);
    return !importedIds.has(externalId);
  });
  if (args.limit !== null) candidateUrls = candidateUrls.slice(0, args.limit);

  console.log(`\n${domain}: ${allUrls.length} total | ${importedIds.size} already imported | ${candidateUrls.length} to process\n`);

  for (let i = 0; i < candidateUrls.length; i++) {
    const url = candidateUrls[i];
    console.log(`[${i + 1}/${candidateUrls.length}] ${url}`);
    await processChurch(url, domain, config, existingChurches, args, stats);
    if (i < candidateUrls.length - 1) await sleep(REQUEST_DELAY_MS);
  }

  return stats;
}
  • Step 3: Add main() function
// ─── Main ─────────────────────────────────────────────────────────────────────

async function main() {
  const args = parseCLIArgs();
  validateArgs(args);

  if (args.jobId) {
    try {
      await prisma.backgroundJob.update({
        where: { id: args.jobId },
        data: { status: 'running', startedAt: new Date() },
      });
    } catch { /* job may not exist yet */ }
  }

  const domainsToRun: [string, SiteConfig][] = args.all
    ? Object.entries(NETWORK_SITES)
    : [[args.domain!, NETWORK_SITES[args.domain!]]];

  const totalStats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 };

  try {
    for (let d = 0; d < domainsToRun.length; d++) {
      const [domain, config] = domainsToRun[d];
      console.log(`\n${'─'.repeat(60)}`);
      console.log(`Domain ${d + 1}/${domainsToRun.length}: ${domain} (${config.country})`);
      console.log('─'.repeat(60));
      const stats = await runDomain(domain, config, args);
      totalStats.total              += stats.total;
      totalStats.created            += stats.created;
      totalStats.updated            += stats.updated;
      totalStats.skipped            += stats.skipped;
      totalStats.errors             += stats.errors;
      totalStats.massSchedulesCreated += stats.massSchedulesCreated;
      if (d < domainsToRun.length - 1) await sleep(DOMAIN_DELAY_MS);
    }
  } finally {
    console.log('\n─── Import Complete ───────────────────────────────────────');
    console.log(`Total processed:   ${totalStats.total}`);
    console.log(`Created:           ${totalStats.created}`);
    console.log(`Updated:           ${totalStats.updated}`);
    console.log(`Skipped:           ${totalStats.skipped}`);
    console.log(`Errors:            ${totalStats.errors}`);
    console.log(`Mass schedules:    ${totalStats.massSchedulesCreated}`);

    if (args.jobId) {
      const status = totalStats.errors > totalStats.total * 0.1 ? 'failed' : 'completed';
      try {
        await prisma.backgroundJob.update({
          where: { id: args.jobId },
          data: {
            status,
            completedAt: new Date(),
            processed: totalStats.total,
            succeeded: totalStats.created + totalStats.updated,
            failed: totalStats.errors,
            itemsFound: totalStats.massSchedulesCreated,
          },
        });
      } catch { /* ignore */ }
    }

    await prisma.$disconnect();
    await pool.end();
  }
}

main().catch(err => {
  console.error('Fatal error:', err);
  process.exit(1);
});
  • Step 4: Final compile check
npx tsc --noEmit

Expected: 0 errors.

  • Step 5: Commit
git add scripts/import-buscarmisas-network.ts
git commit -m "feat: add buscarmisas-network importer — CLI + main loop"

Chunk 4: Integration + smoke test

Task 7: package.json and scheduler integration

Files:

  • Modify: package.json

  • Modify: scripts/scheduler.ts

  • Step 1: Add npm script to package.json

In the "scripts" section, add after "import:gcatholic":

"import:buscarmisas-network": "tsx scripts/import-buscarmisas-network.ts",
  • Step 2: Add 5 case blocks to getJobCommand in scheduler.ts

In scripts/scheduler.ts, find the case 'discovermass-import': block (around line 240). After it, before the default: case, add:

case 'buscarmisas-network-BR': {
  const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariosmissa.com.br'];
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}
case 'buscarmisas-network-MX': {
  const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'buscarmisas.com.mx'];
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}
case 'buscarmisas-network-AR': {
  const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariosmisa.com.ar'];
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}
case 'buscarmisas-network-CO': {
  const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'buscarmisas.co'];
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}
case 'buscarmisas-network-CL': {
  const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariomisa.cl'];
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}
  • Step 3: Add 5 pipeline phases to PIPELINE_GROUPS[0].phases in scheduler.ts

In scripts/scheduler.ts, find the PIPELINE_GROUPS array. Inside the first group (name: 'imports'), add after the discovermass-import phase:

{ name: 'buscarmisas-network-BR', type: 'buscarmisas-network-BR', config: {} },
{ name: 'buscarmisas-network-MX', type: 'buscarmisas-network-MX', config: {} },
{ name: 'buscarmisas-network-AR', type: 'buscarmisas-network-AR', config: {} },
{ name: 'buscarmisas-network-CO', type: 'buscarmisas-network-CO', config: {} },
{ name: 'buscarmisas-network-CL', type: 'buscarmisas-network-CL', config: {} },
  • Step 4: TypeScript compile check
npx tsc --noEmit

Expected: 0 errors.

  • Step 5: Commit
git add package.json scripts/scheduler.ts
git commit -m "feat: add buscarmisas-network to package.json and scheduler pipeline"

Task 8: Smoke test against live sites

  • Step 1: Dry-run Brazil (verifies parsing + sitemap, no DB writes)
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run

Expected: prints church names with mass counts, no DB errors, >4,000 URLs discovered.

  • Step 2: Live run — 3 churches from Brazil
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --limit 3

Expected: 3 churches created in DB, mass schedules created, no errors.

  • Step 3: Verify in DB
npx tsx -e "
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) });
const churches = await prisma.church.findMany({
  where: { source: 'buscarmisas-network' },
  select: { name: true, country: true, buscarmisasNetworkId: true, latitude: true, longitude: true },
  take: 5,
});
console.table(churches);
const massCount = await prisma.massSchedule.count({
  where: { church: { source: 'buscarmisas-network' } },
});
console.log('Mass schedules created:', massCount);
await prisma.\$disconnect(); await pool.end();
"

Expected: 3 rows with source buscarmisas-network, valid lat/lng, buscarmisasNetworkId populated.

  • Step 4: Test idempotency (re-run should skip already-imported)

Re-run the same limited test. Expected: 0 to process (all skipped via the importedIds Set).

  • Step 5: Dry-run Chile (verifies post-sitemap path)
npx tsx scripts/import-buscarmisas-network.ts --domain horariomisa.cl --dry-run

Expected: ~935 URLs discovered, Spanish day names parsed correctly.

  • Step 6: Final commit
git add scripts/import-buscarmisas-network.ts package.json scripts/scheduler.ts
git commit -m "feat: complete buscarmisas-network importer — Brazil, Mexico, Argentina, Colombia, Chile"