Files
ScraperControl/docs/superpowers/plans/2026-03-17-buscarmisas-network-importer.md

1052 lines
36 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BuscarMisas Network Importer — Implementation Plan
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a single config-driven importer that scrapes ~15,294 Catholic churches and mass schedules from 5 Latin American WordPress-based directories (Brazil, Mexico, Argentina, Colombia, Chile).
**Architecture:** A `NETWORK_SITES` config map drives a single `import-buscarmisas-network.ts` script. Church HTML parsing extracts name, address, phone, coordinates, and weekly schedule. The external ID `{domain-slug}/{church-slug}` stored in a new `buscarmisasNetworkId` column prevents duplicate inserts on re-runs.
**Tech Stack:** TypeScript, tsx, Prisma 7 + pg adapter, existing `church-matcher.ts` + `day-names.ts` utilities.
---
## Chunk 1: Schema prerequisite + church-matcher update
### Task 1: Add `buscarmisasNetworkId` to BethelGuide schema
> ⚠️ BethelGuide is the schema source of truth. Never run `prisma migrate` in ScraperControl.
**Files:**
- Modify (in BethelGuide repo): `prisma/schema.prisma`
- Modify (in BethelGuide repo): migration SQL file
- [ ] **Step 1: In BethelGuide, open `prisma/schema.prisma` and add the column to the `Church` model**
Add after the existing `discovermassId` line:
```prisma
buscarmisasNetworkId String? @unique @map("buscarmisas_network_id")
```
And add to the `@@index` block at the bottom of the `Church` model:
```prisma
@@index([buscarmisasNetworkId])
```
- [ ] **Step 2: In BethelGuide, create and run the migration**
```bash
npx prisma migrate dev --name add_buscarmisas_network_id
```
Expected: migration file created, column added to the shared PostgreSQL database.
- [ ] **Step 3: Sync the updated schema to ScraperControl**
```bash
cp prisma/schema.prisma ~/Documents/ScraperControl/prisma/schema.prisma
```
- [ ] **Step 4: Regenerate Prisma client in ScraperControl**
```bash
cd ~/Documents/ScraperControl
npx prisma generate
```
Expected: no errors, `@prisma/client` regenerated with `buscarmisasNetworkId` field.
- [ ] **Step 5: Verify the field is available**
```bash
npx tsx -e "
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) });
prisma.church.findFirst({ select: { buscarmisasNetworkId: true } }).then(r => {
console.log('buscarmisasNetworkId field present:', JSON.stringify(r));
return prisma.\$disconnect().then(() => pool.end());
});
"
```
Expected: prints `buscarmisasNetworkId field present: null` or `{}` (not a type error).
- [ ] **Step 6: Commit in ScraperControl**
```bash
git add prisma/schema.prisma
git commit -m "chore: sync schema — add buscarmisasNetworkId column"
```
---
### Task 2: Update `church-matcher.ts` with new field + ID-match pass
**Files:**
- Modify: `src/lib/church-matcher.ts`
- [ ] **Step 1: Add `buscarmisasNetworkId` to `ExistingChurch` interface**
In `src/lib/church-matcher.ts`, find the `ExistingChurch` interface (line ~11). The interface currently ends with `gottesdienstzeitenId: string | null;` followed by `source: string;`. Insert the two new fields immediately before the `source:` line:
```ts
discovermassId: string | null;
buscarmisasNetworkId: string | null;
source: string; // ← already exists, shown for placement only
```
Note: `discovermassId` was missing from the interface (pre-existing gap) — adding it here ensures the `loadExistingChurches` select in Task 5 compiles correctly.
- [ ] **Step 2: Add `buscarmisasNetworkId` to `ChurchCandidate` type**
Find the `ChurchCandidate` type (line ~122). After the existing `horariosMisasId?: string;` and all other existing optional ID fields, add:
```ts
discovermassId?: string;
buscarmisasNetworkId?: string;
```
- [ ] **Step 3: Add ID-match passes in `findDuplicateChurch`**
The existing passes run 113 (osmId through gottesdienstzeitenId), with pass 14 being proximity+name at line ~259. Find the **Thirteenth pass** block (gottesdienstzeitenId, line ~251):
```ts
// Thirteenth pass: exact gottesdienstzeitenId match
if (candidate.gottesdienstzeitenId) {
...
}
```
Insert two new passes **after** it and **before** the proximity pass comment (`// Fourteenth pass: proximity + name match`):
```ts
// Fourteenth pass: exact discovermassId match
if (candidate.discovermassId) {
const match = existingChurches.find(
(church) => church.discovermassId === candidate.discovermassId
);
if (match) return match;
}
// Fifteenth pass: exact buscarmisasNetworkId match
if (candidate.buscarmisasNetworkId) {
const match = existingChurches.find(
(church) => church.buscarmisasNetworkId === candidate.buscarmisasNetworkId
);
if (match) return match;
}
```
Then update the existing proximity pass comment from `// Fourteenth pass:` to `// Sixteenth pass:`.
- [ ] **Step 4: Verify TypeScript compiles**
```bash
npx tsc --noEmit
```
Expected: 0 errors.
- [ ] **Step 5: Commit**
```bash
git add src/lib/church-matcher.ts
git commit -m "feat: add buscarmisasNetworkId (and discovermassId) to church-matcher interfaces and ID-match passes"
```
---
## Chunk 2: Parsing functions
### Task 3: Write pure parsing functions with unit tests
**Files:**
- Create: `scripts/import-buscarmisas-network.ts` (scaffold with parsing functions only)
We write the parsing functions first as pure functions, then test them with real HTML snippets before wiring them to the HTTP layer.
- [ ] **Step 1: Create `scripts/import-buscarmisas-network.ts` with the file header and types**
```ts
#!/usr/bin/env tsx
/**
* Import Catholic churches and mass schedules from the BuscarMisas network.
*
* A group of 5 identical WordPress-based directories covering Latin America:
* - horariosmissa.com.br (Brazil, ~4,732 churches)
* - buscarmisas.com.mx (Mexico, ~3,950 churches)
* - horariosmisa.com.ar (Argentina, ~3,012 churches)
* - buscarmisas.co (Colombia, ~2,665 churches)
* - horariomisa.cl (Chile, ~935 churches)
*
* Usage:
* npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br
* npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --resume-from 500
* npx tsx scripts/import-buscarmisas-network.ts --all
* npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run
*/
import dotenv from 'dotenv';
import path from 'path';
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';
import { getDayNamesForCountry, buildDayPatterns } from '../src/scrapers/i18n/day-names';
// ─── Site Config ─────────────────────────────────────────────────────────────
interface SiteConfig {
country: string; // ISO 3166-1 alpha-2
language: 'pt' | 'es';
sitemapType: 'page' | 'post';
}
const NETWORK_SITES: Record<string, SiteConfig> = {
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
'buscarmisas.com.mx': { country: 'MX', language: 'es', sitemapType: 'page' },
'horariosmisa.com.ar': { country: 'AR', language: 'es', sitemapType: 'page' },
'buscarmisas.co': { country: 'CO', language: 'es', sitemapType: 'page' },
'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' },
};
// ─── Types ────────────────────────────────────────────────────────────────────
interface ParsedChurch {
name: string;
address: string | null;
city: string | null;
state: string | null;
phone: string | null;
lat: number;
lng: number;
externalId: string;
country: string;
}
interface ParsedMass {
dayOfWeek: number; // 0 = Sunday, 6 = Saturday
time: string; // HH:MM 24-hour
}
interface CLIArgs {
domain: string | null;
all: boolean;
dryRun: boolean;
resumeFrom: number;
limit: number | null;
jobId: string | null;
}
interface ImportStats {
total: number;
created: number;
updated: number;
skipped: number;
errors: number;
massSchedulesCreated: number;
}
```
- [ ] **Step 2: Add `buildExternalId` helper**
```ts
// ─── Helpers ─────────────────────────────────────────────────────────────────
/**
* Build external ID for a church URL.
* Format: "{domain-slug}/{church-slug}"
* e.g. "horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios"
*/
export function buildExternalId(domain: string, churchUrl: string): string {
const domainSlug = domain.replace(/\./g, '-');
// URL path: /{region}/{city}/{church-slug}/
const segments = churchUrl.replace(/\/$/, '').split('/').filter(Boolean);
const churchSlug = segments[segments.length - 1] || '';
return `${domainSlug}/${churchSlug}`;
}
```
- [ ] **Step 3: Verify `buildExternalId` manually**
```bash
npx tsx -e "
import { buildExternalId } from './scripts/import-buscarmisas-network';
console.log(buildExternalId('horariosmissa.com.br', 'https://horariosmissa.com.br/sao-paulo/sao-paulo/paroquia-nossa-senhora-dos-remedios/'));
// Expected: horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios
console.log(buildExternalId('buscarmisas.co', 'https://buscarmisas.co/bogota/bogota/parroquia-san-pedro/'));
// Expected: buscarmisas-co/parroquia-san-pedro
"
```
- [ ] **Step 4: Add `parseChurchPage` function**
```ts
/**
* Parse church data from a church page HTML string.
* Returns null if name or coordinates cannot be extracted.
*/
export function parseChurchPage(
html: string,
domain: string,
churchUrl: string,
config: SiteConfig,
): ParsedChurch | null {
// Name: cell after <strong>Nome</strong> (PT) or <strong>Nombre</strong> (ES)
const nameLabel = config.language === 'pt' ? 'Nome' : 'Nombre';
const nameMatch = html.match(
new RegExp(`<strong>${nameLabel}<\\/strong><\\/td>\\s*<td>([^<]+)<\\/td>`, 'i')
);
const name = nameMatch?.[1]?.trim() ?? '';
if (!name) return null;
// Coordinates: Google Maps iframe center= parameter
const coordMatch = html.match(/center=([-\d.]+)%2C([-\d.]+)/i);
if (!coordMatch) return null;
const lat = parseFloat(coordMatch[1]);
const lng = parseFloat(coordMatch[2]);
if (!isFinite(lat) || !isFinite(lng) || Math.abs(lat) > 90 || Math.abs(lng) > 180) return null;
// Address: cell after <strong>Endereço</strong> (PT) or <strong>Dirección</strong> (ES)
const addrLabel = config.language === 'pt' ? 'Endere[çc]o' : 'Direcci[oó]n';
const addrMatch = html.match(
new RegExp(`<strong>${addrLabel}<\\/strong><\\/td>\\s*<td>([^<]+)<\\/td>`, 'i')
);
const address = addrMatch?.[1]?.trim() ?? null;
// Phone: tel: href
const phoneMatch = html.match(/href="tel:([^"]+)"/i);
const phone = phoneMatch?.[1]?.trim() ?? null;
// City and state from URL path segments
const segments = churchUrl.replace(/\/$/, '').split('/').filter(Boolean);
// segments[2] = region/state, segments[3] = city (after domain), but URL is full URL
// URL form: https://{domain}/{state}/{city}/{slug}/
const urlPath = new URL(churchUrl).pathname.split('/').filter(Boolean);
const state = urlPath[0] ? decodeURIComponent(urlPath[0].replace(/-/g, ' ')) : null;
const city = urlPath[1] ? decodeURIComponent(urlPath[1].replace(/-/g, ' ')) : null;
return {
name,
address,
city,
state,
phone,
lat,
lng,
externalId: buildExternalId(domain, churchUrl),
country: config.country,
};
}
```
- [ ] **Step 5: Add `parseMassSchedule` function**
```ts
/**
* Parse the weekly mass schedule table from church page HTML.
* Table format: day-name cell | time cell (comma-separated times, "-" = no mass)
*/
export function parseMassSchedule(html: string, countryCode: string): ParsedMass[] {
const dayPatterns = buildDayPatterns(getDayNamesForCountry(countryCode));
const results: ParsedMass[] = [];
// Extract all <td> cells as pairs [day, time]
const cells = [...html.matchAll(/<td[^>]*>(.*?)<\/td>/gis)].map(m =>
m[1].replace(/<[^>]+>/g, '').trim()
);
for (let i = 0; i + 1 < cells.length; i += 2) {
const dayCell = cells[i].toLowerCase();
const timeCell = cells[i + 1];
const dayOfWeek = dayPatterns[dayCell];
if (dayOfWeek === undefined) continue;
if (timeCell === '-' || !timeCell) continue;
// Split comma-separated times: "10:00, 18:00" → ["10:00", "18:00"]
for (const rawTime of timeCell.split(',')) {
const time = rawTime.trim();
if (/^\d{1,2}:\d{2}$/.test(time)) {
results.push({ dayOfWeek, time });
}
}
}
return results;
}
```
- [ ] **Step 6: Test `parseChurchPage` and `parseMassSchedule` with real HTML**
```bash
npx tsx -e "
import { parseChurchPage, parseMassSchedule } from './scripts/import-buscarmisas-network';
const NETWORK_SITES = {
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
};
async function test() {
const res = await fetch('https://horariosmissa.com.br/sao-paulo/sao-paulo/paroquia-nossa-senhora-dos-remedios/');
const html = await res.text();
const config = NETWORK_SITES['horariosmissa.com.br'];
const parsed = parseChurchPage(html, 'horariosmissa.com.br', res.url, config);
console.log('Church:', JSON.stringify(parsed, null, 2));
const masses = parseMassSchedule(html, config.country);
console.log('Masses:', JSON.stringify(masses, null, 2));
}
test().catch(console.error);
"
```
Expected output (exact values are illustrative — website content may change):
```
Church: {
"name": "Paróquia Nossa Senhora dos Remédios", // or current name
"address": "R. Ten. Azevedo, 182 ...",
"city": "sao paulo",
"state": "sao paulo",
"phone": "+55 11 ...",
"lat": -23.56...,
"lng": -46.62...,
"externalId": "horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios",
"country": "BR"
}
Masses: [ { "dayOfWeek": 2, "time": "17:00" }, ... ]
```
Verify: `church` is non-null, `lat`/`lng` are non-zero finite numbers, `externalId` matches `horariosmissa-com-br/{slug}` pattern, `masses` array is non-empty with dayOfWeek 06 and HH:MM times.
- [ ] **Step 7: Test with a Spanish-language site (Mexico)**
```bash
npx tsx -e "
import { parseChurchPage, parseMassSchedule } from './scripts/import-buscarmisas-network';
const config = { country: 'MX', language: 'es', sitemapType: 'page' };
const domain = 'buscarmisas.com.mx';
const url = 'https://buscarmisas.com.mx/nuevo-leon/monterrey/parroquia-anunciacion-a-maria/';
fetch(url).then(r => r.text()).then(html => {
console.log('Church:', JSON.stringify(parseChurchPage(html, domain, url, config), null, 2));
console.log('Masses:', JSON.stringify(parseMassSchedule(html, config.country), null, 2));
}).catch(console.error);
"
```
Expected: name, coordinates, and Spanish-language schedule rows parsed correctly.
- [ ] **Step 8: Commit parsing scaffold**
```bash
git add scripts/import-buscarmisas-network.ts
git commit -m "feat: add buscarmisas-network importer — parsing functions"
```
---
### Task 4: Sitemap discovery function
**Files:**
- Modify: `scripts/import-buscarmisas-network.ts`
- [ ] **Step 1: Add HTTP helpers**
```ts
// ─── HTTP Helpers ─────────────────────────────────────────────────────────────
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 2_000;
const DOMAIN_DELAY_MS = 5_000;
async function fetchText(url: string): Promise<string> {
const res = await fetch(url, { headers: { 'User-Agent': USER_AGENT } });
if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`);
return res.text();
}
async function fetchWithRetry(url: string, retries = 3): Promise<string> {
for (let attempt = 1; attempt <= retries; attempt++) {
try {
return await fetchText(url);
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
if (attempt === retries) throw err;
const isRetryable = msg.includes('429') || msg.includes('503');
if (!isRetryable) throw err;
const backoff = attempt * 30_000; // 30s, 60s, 90s
console.warn(` [retry ${attempt}/${retries}] ${msg} — waiting ${backoff / 1000}s`);
await sleep(backoff);
}
}
throw new Error('unreachable');
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
```
- [ ] **Step 2: Add `getChurchUrls` function**
```ts
// ─── Sitemap Discovery ────────────────────────────────────────────────────────
/**
* Fetch all church page URLs for a domain from its sitemap.
* Church URLs have exactly 3 path segments: /{region}/{city}/{slug}/
*/
export async function getChurchUrls(domain: string, config: SiteConfig): Promise<string[]> {
const indexUrl = `https://${domain}/sitemap_index.xml`;
console.log(`Fetching sitemap index: ${indexUrl}`);
const indexXml = await fetchWithRetry(indexUrl);
// Extract child sitemap URLs matching the sitemapType
const childPattern = config.sitemapType === 'page'
? /https:\/\/[^<]*\/page-sitemap\d*\.xml/g
: /https:\/\/[^<]*\/post-sitemap\.xml/g;
const childUrls = [...indexXml.matchAll(childPattern)].map(m => m[0]);
console.log(` Found ${childUrls.length} child sitemaps`);
const churchUrls: string[] = [];
for (const sitemapUrl of childUrls) {
const xml = await fetchWithRetry(sitemapUrl);
const locs = [...xml.matchAll(/<loc>([^<]+)<\/loc>/g)].map(m => m[1].trim());
for (const loc of locs) {
// Church URLs: exactly 3 non-empty path segments after the domain
try {
const segments = new URL(loc).pathname.split('/').filter(Boolean);
if (segments.length === 3) {
churchUrls.push(loc);
}
} catch { /* skip malformed URLs */ }
}
}
// Deduplicate
const unique = [...new Set(churchUrls)];
console.log(` Total church URLs: ${unique.length}`);
return unique;
}
```
- [ ] **Step 3: Verify sitemap discovery against known counts**
```bash
npx tsx -e "
import { getChurchUrls } from './scripts/import-buscarmisas-network';
const NETWORK_SITES = {
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' },
};
for (const [domain, config] of Object.entries(NETWORK_SITES)) {
const urls = await getChurchUrls(domain, config);
console.log(domain, '->', urls.length, 'churches');
console.log(' Sample:', urls[0]);
}
"
```
Expected: Brazil ~4,700+ URLs, Chile ~930+ URLs.
- [ ] **Step 4: Commit**
```bash
git add scripts/import-buscarmisas-network.ts
git commit -m "feat: add buscarmisas-network importer — sitemap discovery"
```
---
## Chunk 3: Main importer
### Task 5: DB helpers and church processing loop
**Files:**
- Modify: `scripts/import-buscarmisas-network.ts`
- [ ] **Step 1: Add DB connection and `loadExistingChurches`**
At the top of the file (after dotenv), add the DB setup:
```ts
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({ connectionString: dbUrl, ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined });
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });
```
Then add `loadExistingChurches`:
```ts
// ─── DB Helpers ───────────────────────────────────────────────────────────────
async function loadExistingChurches(country: string): Promise<ExistingChurch[]> {
console.log(`Loading existing ${country} churches from DB...`);
const churches = await prisma.church.findMany({
where: { country },
select: {
id: true, name: true, latitude: true, longitude: true,
osmId: true, baiduId: true, masstimesId: true,
orarimesseId: true, massSchedulesPhId: true, philmassId: true,
horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true,
messesInfoId: true, bohosluzbyId: true, miserendId: true,
kerknetId: true, gottesdienstzeitenId: true, discovermassId: true,
buscarmisasNetworkId: true,
source: true, website: true, phone: true, address: true, country: true,
},
});
console.log(` Loaded ${churches.length} existing ${country} churches`);
return churches as ExistingChurch[];
}
```
- [ ] **Step 2: Add `processChurch` function**
```ts
// ─── Church Processing ────────────────────────────────────────────────────────
async function processChurch(
url: string,
domain: string,
config: SiteConfig,
existingChurches: ExistingChurch[],
args: CLIArgs,
stats: ImportStats,
): Promise<void> {
stats.total++;
try {
const html = await fetchWithRetry(url);
const parsed = parseChurchPage(html, domain, url, config);
if (!parsed) {
console.log(` [skip] No name/coords: ${url}`);
stats.skipped++;
return;
}
const masses = parseMassSchedule(html, config.country);
if (args.dryRun) {
console.log(` [dry-run] ${parsed.name}${masses.length} masses`);
return;
}
const candidate = {
name: parsed.name,
lat: parsed.lat,
lng: parsed.lng,
buscarmisasNetworkId: parsed.externalId,
};
const duplicate = findDuplicateChurch(candidate, existingChurches);
if (duplicate) {
const updateData: Record<string, unknown> = { buscarmisasNetworkId: parsed.externalId };
if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone;
if (parsed.lat !== 0 && duplicate.latitude === 0) {
updateData.latitude = parsed.lat;
updateData.longitude = parsed.lng;
}
await prisma.$transaction(async (tx) => {
await tx.church.update({ where: { id: duplicate.id }, data: updateData });
if (masses.length > 0) {
await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } });
await tx.massSchedule.createMany({
data: masses.map(m => ({ churchId: duplicate.id, dayOfWeek: m.dayOfWeek, time: m.time, language: config.language === 'pt' ? 'Portuguese' : 'Spanish', notes: null })),
});
}
await tx.church.update({ where: { id: duplicate.id }, data: { lastScrapedAt: new Date() } });
});
duplicate.buscarmisasNetworkId = parsed.externalId;
stats.updated++;
} else {
const church = await prisma.church.create({
data: {
name: parsed.name,
address: parsed.address,
city: parsed.city,
state: parsed.state,
country: parsed.country,
phone: parsed.phone,
latitude: parsed.lat,
longitude: parsed.lng,
buscarmisasNetworkId: parsed.externalId,
source: 'buscarmisas-network',
hasWebsite: false,
},
});
existingChurches.push({
id: church.id, name: parsed.name, latitude: parsed.lat, longitude: parsed.lng,
osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
bohosluzbyId: null, miserendId: null, kerknetId: null,
gottesdienstzeitenId: null, discovermassId: null,
buscarmisasNetworkId: parsed.externalId,
source: 'buscarmisas-network', website: null, phone: parsed.phone,
address: parsed.address, country: parsed.country,
});
if (masses.length > 0) {
await prisma.massSchedule.createMany({
data: masses.map(m => ({
churchId: church.id,
dayOfWeek: m.dayOfWeek,
time: m.time,
language: config.language === 'pt' ? 'Portuguese' : 'Spanish',
notes: null,
})),
});
await prisma.church.update({ where: { id: church.id }, data: { lastScrapedAt: new Date() } });
}
stats.created++;
}
stats.massSchedulesCreated += masses.length;
console.log(
` [${duplicate ? 'update' : 'create'}] ${parsed.name}${masses.length} masses — ` +
`${stats.total} total (${stats.created}${stats.updated}${stats.errors}✗)`
);
} catch (err) {
stats.errors++;
console.error(` [error] ${url}: ${err instanceof Error ? err.message : err}`);
}
}
```
- [ ] **Step 3: Compile-check the file so far**
```bash
npx tsc --noEmit
```
Expected: 0 errors.
- [ ] **Step 4: Commit**
```bash
git add scripts/import-buscarmisas-network.ts
git commit -m "feat: add buscarmisas-network importer — DB helpers and church processing"
```
---
### Task 6: CLI parsing and `main()` function
**Files:**
- Modify: `scripts/import-buscarmisas-network.ts`
- [ ] **Step 1: Add `parseCLIArgs`**
```ts
// ─── CLI ──────────────────────────────────────────────────────────────────────
function parseCLIArgs(): CLIArgs {
const argv = process.argv.slice(2);
const result: CLIArgs = { domain: null, all: false, dryRun: false, resumeFrom: 0, limit: null, jobId: null };
for (let i = 0; i < argv.length; i++) {
switch (argv[i]) {
case '--domain': result.domain = argv[++i]; break;
case '--all': result.all = true; break;
case '--dry-run': result.dryRun = true; break;
case '--resume-from': result.resumeFrom = parseInt(argv[++i], 10); break;
case '--limit': result.limit = parseInt(argv[++i], 10); break;
case '--job-id': result.jobId = argv[++i]; break;
}
}
return result;
}
function validateArgs(args: CLIArgs): void {
if (!args.domain && !args.all) {
console.error('Usage:');
console.error(' npx tsx scripts/import-buscarmisas-network.ts --domain <domain>');
console.error(' npx tsx scripts/import-buscarmisas-network.ts --all');
console.error('\nValid domains:', Object.keys(NETWORK_SITES).join(', '));
process.exit(1);
}
if (args.domain && !NETWORK_SITES[args.domain]) {
console.error(`Unknown domain: ${args.domain}`);
console.error('Valid domains:', Object.keys(NETWORK_SITES).join(', '));
process.exit(1);
}
if (args.all && args.resumeFrom > 0) {
console.error('--resume-from cannot be used with --all. Use --domain to resume a specific site.');
process.exit(1);
}
}
```
- [ ] **Step 2: Add `runDomain` function**
```ts
async function runDomain(domain: string, config: SiteConfig, args: CLIArgs): Promise<ImportStats> {
const stats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 };
const allUrls = await getChurchUrls(domain, config);
const existingChurches = await loadExistingChurches(config.country);
// Build set of already-imported IDs for fast skip
const importedIds = new Set(
existingChurches.filter(c => c.buscarmisasNetworkId).map(c => c.buscarmisasNetworkId!)
);
let candidateUrls = allUrls.slice(args.resumeFrom).filter(url => {
const externalId = buildExternalId(domain, url);
return !importedIds.has(externalId);
});
if (args.limit !== null) candidateUrls = candidateUrls.slice(0, args.limit);
console.log(`\n${domain}: ${allUrls.length} total | ${importedIds.size} already imported | ${candidateUrls.length} to process\n`);
for (let i = 0; i < candidateUrls.length; i++) {
const url = candidateUrls[i];
console.log(`[${i + 1}/${candidateUrls.length}] ${url}`);
await processChurch(url, domain, config, existingChurches, args, stats);
if (i < candidateUrls.length - 1) await sleep(REQUEST_DELAY_MS);
}
return stats;
}
```
- [ ] **Step 3: Add `main()` function**
```ts
// ─── Main ─────────────────────────────────────────────────────────────────────
async function main() {
const args = parseCLIArgs();
validateArgs(args);
if (args.jobId) {
try {
await prisma.backgroundJob.update({
where: { id: args.jobId },
data: { status: 'running', startedAt: new Date() },
});
} catch { /* job may not exist yet */ }
}
const domainsToRun: [string, SiteConfig][] = args.all
? Object.entries(NETWORK_SITES)
: [[args.domain!, NETWORK_SITES[args.domain!]]];
const totalStats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 };
try {
for (let d = 0; d < domainsToRun.length; d++) {
const [domain, config] = domainsToRun[d];
console.log(`\n${'─'.repeat(60)}`);
console.log(`Domain ${d + 1}/${domainsToRun.length}: ${domain} (${config.country})`);
console.log('─'.repeat(60));
const stats = await runDomain(domain, config, args);
totalStats.total += stats.total;
totalStats.created += stats.created;
totalStats.updated += stats.updated;
totalStats.skipped += stats.skipped;
totalStats.errors += stats.errors;
totalStats.massSchedulesCreated += stats.massSchedulesCreated;
if (d < domainsToRun.length - 1) await sleep(DOMAIN_DELAY_MS);
}
} finally {
console.log('\n─── Import Complete ───────────────────────────────────────');
console.log(`Total processed: ${totalStats.total}`);
console.log(`Created: ${totalStats.created}`);
console.log(`Updated: ${totalStats.updated}`);
console.log(`Skipped: ${totalStats.skipped}`);
console.log(`Errors: ${totalStats.errors}`);
console.log(`Mass schedules: ${totalStats.massSchedulesCreated}`);
if (args.jobId) {
const status = totalStats.errors > totalStats.total * 0.1 ? 'failed' : 'completed';
try {
await prisma.backgroundJob.update({
where: { id: args.jobId },
data: {
status,
completedAt: new Date(),
processed: totalStats.total,
succeeded: totalStats.created + totalStats.updated,
failed: totalStats.errors,
itemsFound: totalStats.massSchedulesCreated,
},
});
} catch { /* ignore */ }
}
await prisma.$disconnect();
await pool.end();
}
}
main().catch(err => {
console.error('Fatal error:', err);
process.exit(1);
});
```
- [ ] **Step 4: Final compile check**
```bash
npx tsc --noEmit
```
Expected: 0 errors.
- [ ] **Step 5: Commit**
```bash
git add scripts/import-buscarmisas-network.ts
git commit -m "feat: add buscarmisas-network importer — CLI + main loop"
```
---
## Chunk 4: Integration + smoke test
### Task 7: package.json and scheduler integration
**Files:**
- Modify: `package.json`
- Modify: `scripts/scheduler.ts`
- [ ] **Step 1: Add npm script to `package.json`**
In the `"scripts"` section, add after `"import:gcatholic"`:
```json
"import:buscarmisas-network": "tsx scripts/import-buscarmisas-network.ts",
```
- [ ] **Step 2: Add 5 case blocks to `getJobCommand` in `scheduler.ts`**
In `scripts/scheduler.ts`, find the `case 'discovermass-import':` block (around line 240). After it, before the `default:` case, add:
```ts
case 'buscarmisas-network-BR': {
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariosmissa.com.br'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
case 'buscarmisas-network-MX': {
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'buscarmisas.com.mx'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
case 'buscarmisas-network-AR': {
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariosmisa.com.ar'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
case 'buscarmisas-network-CO': {
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'buscarmisas.co'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
case 'buscarmisas-network-CL': {
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariomisa.cl'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
```
- [ ] **Step 3: Add 5 pipeline phases to `PIPELINE_GROUPS[0].phases` in `scheduler.ts`**
In `scripts/scheduler.ts`, find the `PIPELINE_GROUPS` array. Inside the first group (`name: 'imports'`), add after the `discovermass-import` phase:
```ts
{ name: 'buscarmisas-network-BR', type: 'buscarmisas-network-BR', config: {} },
{ name: 'buscarmisas-network-MX', type: 'buscarmisas-network-MX', config: {} },
{ name: 'buscarmisas-network-AR', type: 'buscarmisas-network-AR', config: {} },
{ name: 'buscarmisas-network-CO', type: 'buscarmisas-network-CO', config: {} },
{ name: 'buscarmisas-network-CL', type: 'buscarmisas-network-CL', config: {} },
```
- [ ] **Step 4: TypeScript compile check**
```bash
npx tsc --noEmit
```
Expected: 0 errors.
- [ ] **Step 5: Commit**
```bash
git add package.json scripts/scheduler.ts
git commit -m "feat: add buscarmisas-network to package.json and scheduler pipeline"
```
---
### Task 8: Smoke test against live sites
- [ ] **Step 1: Dry-run Brazil (verifies parsing + sitemap, no DB writes)**
```bash
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run
```
Expected: prints church names with mass counts, no DB errors, >4,000 URLs discovered.
- [ ] **Step 2: Live run — 3 churches from Brazil**
```bash
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --limit 3
```
Expected: 3 churches created in DB, mass schedules created, no errors.
- [ ] **Step 3: Verify in DB**
```bash
npx tsx -e "
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) });
const churches = await prisma.church.findMany({
where: { source: 'buscarmisas-network' },
select: { name: true, country: true, buscarmisasNetworkId: true, latitude: true, longitude: true },
take: 5,
});
console.table(churches);
const massCount = await prisma.massSchedule.count({
where: { church: { source: 'buscarmisas-network' } },
});
console.log('Mass schedules created:', massCount);
await prisma.\$disconnect(); await pool.end();
"
```
Expected: 3 rows with source `buscarmisas-network`, valid lat/lng, `buscarmisasNetworkId` populated.
- [ ] **Step 4: Test idempotency (re-run should skip already-imported)**
Re-run the same limited test. Expected: `0 to process` (all skipped via the `importedIds` Set).
- [ ] **Step 5: Dry-run Chile (verifies post-sitemap path)**
```bash
npx tsx scripts/import-buscarmisas-network.ts --domain horariomisa.cl --dry-run
```
Expected: ~935 URLs discovered, Spanish day names parsed correctly.
- [ ] **Step 6: Final commit**
```bash
git add scripts/import-buscarmisas-network.ts package.json scripts/scheduler.ts
git commit -m "feat: complete buscarmisas-network importer — Brazil, Mexico, Argentina, Colombia, Chile"
```