1052 lines
36 KiB
Markdown
1052 lines
36 KiB
Markdown
|
|
# BuscarMisas Network Importer — Implementation Plan
|
|||
|
|
|
|||
|
|
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|||
|
|
|
|||
|
|
**Goal:** Add a single config-driven importer that scrapes ~15,294 Catholic churches and mass schedules from 5 Latin American WordPress-based directories (Brazil, Mexico, Argentina, Colombia, Chile).
|
|||
|
|
|
|||
|
|
**Architecture:** A `NETWORK_SITES` config map drives a single `import-buscarmisas-network.ts` script. Church HTML parsing extracts name, address, phone, coordinates, and weekly schedule. The external ID `{domain-slug}/{church-slug}` stored in a new `buscarmisasNetworkId` column prevents duplicate inserts on re-runs.
|
|||
|
|
|
|||
|
|
**Tech Stack:** TypeScript, tsx, Prisma 7 + pg adapter, existing `church-matcher.ts` + `day-names.ts` utilities.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Chunk 1: Schema prerequisite + church-matcher update
|
|||
|
|
|
|||
|
|
### Task 1: Add `buscarmisasNetworkId` to BethelGuide schema
|
|||
|
|
|
|||
|
|
> ⚠️ BethelGuide is the schema source of truth. Never run `prisma migrate` in ScraperControl.
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Modify (in BethelGuide repo): `prisma/schema.prisma`
|
|||
|
|
- Modify (in BethelGuide repo): migration SQL file
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: In BethelGuide, open `prisma/schema.prisma` and add the column to the `Church` model**
|
|||
|
|
|
|||
|
|
Add after the existing `discovermassId` line:
|
|||
|
|
```prisma
|
|||
|
|
buscarmisasNetworkId String? @unique @map("buscarmisas_network_id")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
And add to the `@@index` block at the bottom of the `Church` model:
|
|||
|
|
```prisma
|
|||
|
|
@@index([buscarmisasNetworkId])
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: In BethelGuide, create and run the migration**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx prisma migrate dev --name add_buscarmisas_network_id
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: migration file created, column added to the shared PostgreSQL database.
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Sync the updated schema to ScraperControl**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cp prisma/schema.prisma ~/Documents/ScraperControl/prisma/schema.prisma
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Regenerate Prisma client in ScraperControl**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd ~/Documents/ScraperControl
|
|||
|
|
npx prisma generate
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: no errors, `@prisma/client` regenerated with `buscarmisasNetworkId` field.
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Verify the field is available**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx -e "
|
|||
|
|
import { Pool } from 'pg';
|
|||
|
|
import { PrismaPg } from '@prisma/adapter-pg';
|
|||
|
|
import { PrismaClient } from '@prisma/client';
|
|||
|
|
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
|||
|
|
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) });
|
|||
|
|
prisma.church.findFirst({ select: { buscarmisasNetworkId: true } }).then(r => {
|
|||
|
|
console.log('buscarmisasNetworkId field present:', JSON.stringify(r));
|
|||
|
|
return prisma.\$disconnect().then(() => pool.end());
|
|||
|
|
});
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: prints `buscarmisasNetworkId field present: null` or `{}` (not a type error).
|
|||
|
|
|
|||
|
|
- [ ] **Step 6: Commit in ScraperControl**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add prisma/schema.prisma
|
|||
|
|
git commit -m "chore: sync schema — add buscarmisasNetworkId column"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Task 2: Update `church-matcher.ts` with new field + ID-match pass
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Modify: `src/lib/church-matcher.ts`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Add `buscarmisasNetworkId` to `ExistingChurch` interface**
|
|||
|
|
|
|||
|
|
In `src/lib/church-matcher.ts`, find the `ExistingChurch` interface (line ~11). The interface currently ends with `gottesdienstzeitenId: string | null;` followed by `source: string;`. Insert the two new fields immediately before the `source:` line:
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
discovermassId: string | null;
|
|||
|
|
buscarmisasNetworkId: string | null;
|
|||
|
|
source: string; // ← already exists, shown for placement only
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Note: `discovermassId` was missing from the interface (pre-existing gap) — adding it here ensures the `loadExistingChurches` select in Task 5 compiles correctly.
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Add `buscarmisasNetworkId` to `ChurchCandidate` type**
|
|||
|
|
|
|||
|
|
Find the `ChurchCandidate` type (line ~122). After the existing `horariosMisasId?: string;` and all other existing optional ID fields, add:
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
discovermassId?: string;
|
|||
|
|
buscarmisasNetworkId?: string;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Add ID-match passes in `findDuplicateChurch`**
|
|||
|
|
|
|||
|
|
The existing passes run 1–13 (osmId through gottesdienstzeitenId), with pass 14 being proximity+name at line ~259. Find the **Thirteenth pass** block (gottesdienstzeitenId, line ~251):
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// Thirteenth pass: exact gottesdienstzeitenId match
|
|||
|
|
if (candidate.gottesdienstzeitenId) {
|
|||
|
|
...
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Insert two new passes **after** it and **before** the proximity pass comment (`// Fourteenth pass: proximity + name match`):
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// Fourteenth pass: exact discovermassId match
|
|||
|
|
if (candidate.discovermassId) {
|
|||
|
|
const match = existingChurches.find(
|
|||
|
|
(church) => church.discovermassId === candidate.discovermassId
|
|||
|
|
);
|
|||
|
|
if (match) return match;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Fifteenth pass: exact buscarmisasNetworkId match
|
|||
|
|
if (candidate.buscarmisasNetworkId) {
|
|||
|
|
const match = existingChurches.find(
|
|||
|
|
(church) => church.buscarmisasNetworkId === candidate.buscarmisasNetworkId
|
|||
|
|
);
|
|||
|
|
if (match) return match;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then update the existing proximity pass comment from `// Fourteenth pass:` to `// Sixteenth pass:`.
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Verify TypeScript compiles**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsc --noEmit
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: 0 errors.
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/lib/church-matcher.ts
|
|||
|
|
git commit -m "feat: add buscarmisasNetworkId (and discovermassId) to church-matcher interfaces and ID-match passes"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Chunk 2: Parsing functions
|
|||
|
|
|
|||
|
|
### Task 3: Write pure parsing functions with unit tests
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `scripts/import-buscarmisas-network.ts` (scaffold with parsing functions only)
|
|||
|
|
|
|||
|
|
We write the parsing functions first as pure functions, then test them with real HTML snippets before wiring them to the HTTP layer.
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Create `scripts/import-buscarmisas-network.ts` with the file header and types**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
#!/usr/bin/env tsx
|
|||
|
|
/**
|
|||
|
|
* Import Catholic churches and mass schedules from the BuscarMisas network.
|
|||
|
|
*
|
|||
|
|
* A group of 5 identical WordPress-based directories covering Latin America:
|
|||
|
|
* - horariosmissa.com.br (Brazil, ~4,732 churches)
|
|||
|
|
* - buscarmisas.com.mx (Mexico, ~3,950 churches)
|
|||
|
|
* - horariosmisa.com.ar (Argentina, ~3,012 churches)
|
|||
|
|
* - buscarmisas.co (Colombia, ~2,665 churches)
|
|||
|
|
* - horariomisa.cl (Chile, ~935 churches)
|
|||
|
|
*
|
|||
|
|
* Usage:
|
|||
|
|
* npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br
|
|||
|
|
* npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --resume-from 500
|
|||
|
|
* npx tsx scripts/import-buscarmisas-network.ts --all
|
|||
|
|
* npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run
|
|||
|
|
*/
|
|||
|
|
|
|||
|
|
import dotenv from 'dotenv';
|
|||
|
|
import path from 'path';
|
|||
|
|
|
|||
|
|
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
|||
|
|
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
|||
|
|
|
|||
|
|
import { Pool } from 'pg';
|
|||
|
|
import { PrismaPg } from '@prisma/adapter-pg';
|
|||
|
|
import { PrismaClient } from '@prisma/client';
|
|||
|
|
|
|||
|
|
import { findDuplicateChurch } from '../src/lib/church-matcher';
|
|||
|
|
import type { ExistingChurch } from '../src/lib/church-matcher';
|
|||
|
|
import { getDayNamesForCountry, buildDayPatterns } from '../src/scrapers/i18n/day-names';
|
|||
|
|
|
|||
|
|
// ─── Site Config ─────────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
interface SiteConfig {
|
|||
|
|
country: string; // ISO 3166-1 alpha-2
|
|||
|
|
language: 'pt' | 'es';
|
|||
|
|
sitemapType: 'page' | 'post';
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
const NETWORK_SITES: Record<string, SiteConfig> = {
|
|||
|
|
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
|
|||
|
|
'buscarmisas.com.mx': { country: 'MX', language: 'es', sitemapType: 'page' },
|
|||
|
|
'horariosmisa.com.ar': { country: 'AR', language: 'es', sitemapType: 'page' },
|
|||
|
|
'buscarmisas.co': { country: 'CO', language: 'es', sitemapType: 'page' },
|
|||
|
|
'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' },
|
|||
|
|
};
|
|||
|
|
|
|||
|
|
// ─── Types ────────────────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
interface ParsedChurch {
|
|||
|
|
name: string;
|
|||
|
|
address: string | null;
|
|||
|
|
city: string | null;
|
|||
|
|
state: string | null;
|
|||
|
|
phone: string | null;
|
|||
|
|
lat: number;
|
|||
|
|
lng: number;
|
|||
|
|
externalId: string;
|
|||
|
|
country: string;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
interface ParsedMass {
|
|||
|
|
dayOfWeek: number; // 0 = Sunday, 6 = Saturday
|
|||
|
|
time: string; // HH:MM 24-hour
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
interface CLIArgs {
|
|||
|
|
domain: string | null;
|
|||
|
|
all: boolean;
|
|||
|
|
dryRun: boolean;
|
|||
|
|
resumeFrom: number;
|
|||
|
|
limit: number | null;
|
|||
|
|
jobId: string | null;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
interface ImportStats {
|
|||
|
|
total: number;
|
|||
|
|
created: number;
|
|||
|
|
updated: number;
|
|||
|
|
skipped: number;
|
|||
|
|
errors: number;
|
|||
|
|
massSchedulesCreated: number;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Add `buildExternalId` helper**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// ─── Helpers ─────────────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
/**
|
|||
|
|
* Build external ID for a church URL.
|
|||
|
|
* Format: "{domain-slug}/{church-slug}"
|
|||
|
|
* e.g. "horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios"
|
|||
|
|
*/
|
|||
|
|
export function buildExternalId(domain: string, churchUrl: string): string {
|
|||
|
|
const domainSlug = domain.replace(/\./g, '-');
|
|||
|
|
// URL path: /{region}/{city}/{church-slug}/
|
|||
|
|
const segments = churchUrl.replace(/\/$/, '').split('/').filter(Boolean);
|
|||
|
|
const churchSlug = segments[segments.length - 1] || '';
|
|||
|
|
return `${domainSlug}/${churchSlug}`;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Verify `buildExternalId` manually**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx -e "
|
|||
|
|
import { buildExternalId } from './scripts/import-buscarmisas-network';
|
|||
|
|
console.log(buildExternalId('horariosmissa.com.br', 'https://horariosmissa.com.br/sao-paulo/sao-paulo/paroquia-nossa-senhora-dos-remedios/'));
|
|||
|
|
// Expected: horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios
|
|||
|
|
console.log(buildExternalId('buscarmisas.co', 'https://buscarmisas.co/bogota/bogota/parroquia-san-pedro/'));
|
|||
|
|
// Expected: buscarmisas-co/parroquia-san-pedro
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Add `parseChurchPage` function**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
/**
|
|||
|
|
* Parse church data from a church page HTML string.
|
|||
|
|
* Returns null if name or coordinates cannot be extracted.
|
|||
|
|
*/
|
|||
|
|
export function parseChurchPage(
|
|||
|
|
html: string,
|
|||
|
|
domain: string,
|
|||
|
|
churchUrl: string,
|
|||
|
|
config: SiteConfig,
|
|||
|
|
): ParsedChurch | null {
|
|||
|
|
// Name: cell after <strong>Nome</strong> (PT) or <strong>Nombre</strong> (ES)
|
|||
|
|
const nameLabel = config.language === 'pt' ? 'Nome' : 'Nombre';
|
|||
|
|
const nameMatch = html.match(
|
|||
|
|
new RegExp(`<strong>${nameLabel}<\\/strong><\\/td>\\s*<td>([^<]+)<\\/td>`, 'i')
|
|||
|
|
);
|
|||
|
|
const name = nameMatch?.[1]?.trim() ?? '';
|
|||
|
|
if (!name) return null;
|
|||
|
|
|
|||
|
|
// Coordinates: Google Maps iframe center= parameter
|
|||
|
|
const coordMatch = html.match(/center=([-\d.]+)%2C([-\d.]+)/i);
|
|||
|
|
if (!coordMatch) return null;
|
|||
|
|
const lat = parseFloat(coordMatch[1]);
|
|||
|
|
const lng = parseFloat(coordMatch[2]);
|
|||
|
|
if (!isFinite(lat) || !isFinite(lng) || Math.abs(lat) > 90 || Math.abs(lng) > 180) return null;
|
|||
|
|
|
|||
|
|
// Address: cell after <strong>Endereço</strong> (PT) or <strong>Dirección</strong> (ES)
|
|||
|
|
const addrLabel = config.language === 'pt' ? 'Endere[çc]o' : 'Direcci[oó]n';
|
|||
|
|
const addrMatch = html.match(
|
|||
|
|
new RegExp(`<strong>${addrLabel}<\\/strong><\\/td>\\s*<td>([^<]+)<\\/td>`, 'i')
|
|||
|
|
);
|
|||
|
|
const address = addrMatch?.[1]?.trim() ?? null;
|
|||
|
|
|
|||
|
|
// Phone: tel: href
|
|||
|
|
const phoneMatch = html.match(/href="tel:([^"]+)"/i);
|
|||
|
|
const phone = phoneMatch?.[1]?.trim() ?? null;
|
|||
|
|
|
|||
|
|
// City and state from URL path segments
|
|||
|
|
const segments = churchUrl.replace(/\/$/, '').split('/').filter(Boolean);
|
|||
|
|
// segments[2] = region/state, segments[3] = city (after domain), but URL is full URL
|
|||
|
|
// URL form: https://{domain}/{state}/{city}/{slug}/
|
|||
|
|
const urlPath = new URL(churchUrl).pathname.split('/').filter(Boolean);
|
|||
|
|
const state = urlPath[0] ? decodeURIComponent(urlPath[0].replace(/-/g, ' ')) : null;
|
|||
|
|
const city = urlPath[1] ? decodeURIComponent(urlPath[1].replace(/-/g, ' ')) : null;
|
|||
|
|
|
|||
|
|
return {
|
|||
|
|
name,
|
|||
|
|
address,
|
|||
|
|
city,
|
|||
|
|
state,
|
|||
|
|
phone,
|
|||
|
|
lat,
|
|||
|
|
lng,
|
|||
|
|
externalId: buildExternalId(domain, churchUrl),
|
|||
|
|
country: config.country,
|
|||
|
|
};
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Add `parseMassSchedule` function**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
/**
|
|||
|
|
* Parse the weekly mass schedule table from church page HTML.
|
|||
|
|
* Table format: day-name cell | time cell (comma-separated times, "-" = no mass)
|
|||
|
|
*/
|
|||
|
|
export function parseMassSchedule(html: string, countryCode: string): ParsedMass[] {
|
|||
|
|
const dayPatterns = buildDayPatterns(getDayNamesForCountry(countryCode));
|
|||
|
|
const results: ParsedMass[] = [];
|
|||
|
|
|
|||
|
|
// Extract all <td> cells as pairs [day, time]
|
|||
|
|
const cells = [...html.matchAll(/<td[^>]*>(.*?)<\/td>/gis)].map(m =>
|
|||
|
|
m[1].replace(/<[^>]+>/g, '').trim()
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
for (let i = 0; i + 1 < cells.length; i += 2) {
|
|||
|
|
const dayCell = cells[i].toLowerCase();
|
|||
|
|
const timeCell = cells[i + 1];
|
|||
|
|
|
|||
|
|
const dayOfWeek = dayPatterns[dayCell];
|
|||
|
|
if (dayOfWeek === undefined) continue;
|
|||
|
|
if (timeCell === '-' || !timeCell) continue;
|
|||
|
|
|
|||
|
|
// Split comma-separated times: "10:00, 18:00" → ["10:00", "18:00"]
|
|||
|
|
for (const rawTime of timeCell.split(',')) {
|
|||
|
|
const time = rawTime.trim();
|
|||
|
|
if (/^\d{1,2}:\d{2}$/.test(time)) {
|
|||
|
|
results.push({ dayOfWeek, time });
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
return results;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 6: Test `parseChurchPage` and `parseMassSchedule` with real HTML**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx -e "
|
|||
|
|
import { parseChurchPage, parseMassSchedule } from './scripts/import-buscarmisas-network';
|
|||
|
|
|
|||
|
|
const NETWORK_SITES = {
|
|||
|
|
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
|
|||
|
|
};
|
|||
|
|
|
|||
|
|
async function test() {
|
|||
|
|
const res = await fetch('https://horariosmissa.com.br/sao-paulo/sao-paulo/paroquia-nossa-senhora-dos-remedios/');
|
|||
|
|
const html = await res.text();
|
|||
|
|
const config = NETWORK_SITES['horariosmissa.com.br'];
|
|||
|
|
const parsed = parseChurchPage(html, 'horariosmissa.com.br', res.url, config);
|
|||
|
|
console.log('Church:', JSON.stringify(parsed, null, 2));
|
|||
|
|
const masses = parseMassSchedule(html, config.country);
|
|||
|
|
console.log('Masses:', JSON.stringify(masses, null, 2));
|
|||
|
|
}
|
|||
|
|
test().catch(console.error);
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected output (exact values are illustrative — website content may change):
|
|||
|
|
```
|
|||
|
|
Church: {
|
|||
|
|
"name": "Paróquia Nossa Senhora dos Remédios", // or current name
|
|||
|
|
"address": "R. Ten. Azevedo, 182 ...",
|
|||
|
|
"city": "sao paulo",
|
|||
|
|
"state": "sao paulo",
|
|||
|
|
"phone": "+55 11 ...",
|
|||
|
|
"lat": -23.56...,
|
|||
|
|
"lng": -46.62...,
|
|||
|
|
"externalId": "horariosmissa-com-br/paroquia-nossa-senhora-dos-remedios",
|
|||
|
|
"country": "BR"
|
|||
|
|
}
|
|||
|
|
Masses: [ { "dayOfWeek": 2, "time": "17:00" }, ... ]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Verify: `church` is non-null, `lat`/`lng` are non-zero finite numbers, `externalId` matches `horariosmissa-com-br/{slug}` pattern, `masses` array is non-empty with dayOfWeek 0–6 and HH:MM times.
|
|||
|
|
|
|||
|
|
- [ ] **Step 7: Test with a Spanish-language site (Mexico)**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx -e "
|
|||
|
|
import { parseChurchPage, parseMassSchedule } from './scripts/import-buscarmisas-network';
|
|||
|
|
const config = { country: 'MX', language: 'es', sitemapType: 'page' };
|
|||
|
|
const domain = 'buscarmisas.com.mx';
|
|||
|
|
const url = 'https://buscarmisas.com.mx/nuevo-leon/monterrey/parroquia-anunciacion-a-maria/';
|
|||
|
|
fetch(url).then(r => r.text()).then(html => {
|
|||
|
|
console.log('Church:', JSON.stringify(parseChurchPage(html, domain, url, config), null, 2));
|
|||
|
|
console.log('Masses:', JSON.stringify(parseMassSchedule(html, config.country), null, 2));
|
|||
|
|
}).catch(console.error);
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: name, coordinates, and Spanish-language schedule rows parsed correctly.
|
|||
|
|
|
|||
|
|
- [ ] **Step 8: Commit parsing scaffold**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add scripts/import-buscarmisas-network.ts
|
|||
|
|
git commit -m "feat: add buscarmisas-network importer — parsing functions"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Task 4: Sitemap discovery function
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Modify: `scripts/import-buscarmisas-network.ts`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Add HTTP helpers**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// ─── HTTP Helpers ─────────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
|
|||
|
|
const REQUEST_DELAY_MS = 2_000;
|
|||
|
|
const DOMAIN_DELAY_MS = 5_000;
|
|||
|
|
|
|||
|
|
async function fetchText(url: string): Promise<string> {
|
|||
|
|
const res = await fetch(url, { headers: { 'User-Agent': USER_AGENT } });
|
|||
|
|
if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`);
|
|||
|
|
return res.text();
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
async function fetchWithRetry(url: string, retries = 3): Promise<string> {
|
|||
|
|
for (let attempt = 1; attempt <= retries; attempt++) {
|
|||
|
|
try {
|
|||
|
|
return await fetchText(url);
|
|||
|
|
} catch (err) {
|
|||
|
|
const msg = err instanceof Error ? err.message : String(err);
|
|||
|
|
if (attempt === retries) throw err;
|
|||
|
|
const isRetryable = msg.includes('429') || msg.includes('503');
|
|||
|
|
if (!isRetryable) throw err;
|
|||
|
|
const backoff = attempt * 30_000; // 30s, 60s, 90s
|
|||
|
|
console.warn(` [retry ${attempt}/${retries}] ${msg} — waiting ${backoff / 1000}s`);
|
|||
|
|
await sleep(backoff);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
throw new Error('unreachable');
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
function sleep(ms: number): Promise<void> {
|
|||
|
|
return new Promise(resolve => setTimeout(resolve, ms));
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Add `getChurchUrls` function**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// ─── Sitemap Discovery ────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
/**
|
|||
|
|
* Fetch all church page URLs for a domain from its sitemap.
|
|||
|
|
* Church URLs have exactly 3 path segments: /{region}/{city}/{slug}/
|
|||
|
|
*/
|
|||
|
|
export async function getChurchUrls(domain: string, config: SiteConfig): Promise<string[]> {
|
|||
|
|
const indexUrl = `https://${domain}/sitemap_index.xml`;
|
|||
|
|
console.log(`Fetching sitemap index: ${indexUrl}`);
|
|||
|
|
const indexXml = await fetchWithRetry(indexUrl);
|
|||
|
|
|
|||
|
|
// Extract child sitemap URLs matching the sitemapType
|
|||
|
|
const childPattern = config.sitemapType === 'page'
|
|||
|
|
? /https:\/\/[^<]*\/page-sitemap\d*\.xml/g
|
|||
|
|
: /https:\/\/[^<]*\/post-sitemap\.xml/g;
|
|||
|
|
|
|||
|
|
const childUrls = [...indexXml.matchAll(childPattern)].map(m => m[0]);
|
|||
|
|
console.log(` Found ${childUrls.length} child sitemaps`);
|
|||
|
|
|
|||
|
|
const churchUrls: string[] = [];
|
|||
|
|
for (const sitemapUrl of childUrls) {
|
|||
|
|
const xml = await fetchWithRetry(sitemapUrl);
|
|||
|
|
const locs = [...xml.matchAll(/<loc>([^<]+)<\/loc>/g)].map(m => m[1].trim());
|
|||
|
|
for (const loc of locs) {
|
|||
|
|
// Church URLs: exactly 3 non-empty path segments after the domain
|
|||
|
|
try {
|
|||
|
|
const segments = new URL(loc).pathname.split('/').filter(Boolean);
|
|||
|
|
if (segments.length === 3) {
|
|||
|
|
churchUrls.push(loc);
|
|||
|
|
}
|
|||
|
|
} catch { /* skip malformed URLs */ }
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Deduplicate
|
|||
|
|
const unique = [...new Set(churchUrls)];
|
|||
|
|
console.log(` Total church URLs: ${unique.length}`);
|
|||
|
|
return unique;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Verify sitemap discovery against known counts**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx -e "
|
|||
|
|
import { getChurchUrls } from './scripts/import-buscarmisas-network';
|
|||
|
|
const NETWORK_SITES = {
|
|||
|
|
'horariosmissa.com.br': { country: 'BR', language: 'pt', sitemapType: 'page' },
|
|||
|
|
'horariomisa.cl': { country: 'CL', language: 'es', sitemapType: 'post' },
|
|||
|
|
};
|
|||
|
|
for (const [domain, config] of Object.entries(NETWORK_SITES)) {
|
|||
|
|
const urls = await getChurchUrls(domain, config);
|
|||
|
|
console.log(domain, '->', urls.length, 'churches');
|
|||
|
|
console.log(' Sample:', urls[0]);
|
|||
|
|
}
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: Brazil ~4,700+ URLs, Chile ~930+ URLs.
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add scripts/import-buscarmisas-network.ts
|
|||
|
|
git commit -m "feat: add buscarmisas-network importer — sitemap discovery"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Chunk 3: Main importer
|
|||
|
|
|
|||
|
|
### Task 5: DB helpers and church processing loop
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Modify: `scripts/import-buscarmisas-network.ts`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Add DB connection and `loadExistingChurches`**
|
|||
|
|
|
|||
|
|
At the top of the file (after dotenv), add the DB setup:
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
|
|||
|
|
console.log(`Connecting to: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
|
|||
|
|
const pool = new Pool({ connectionString: dbUrl, ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined });
|
|||
|
|
const adapter = new PrismaPg(pool);
|
|||
|
|
const prisma = new PrismaClient({ adapter });
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then add `loadExistingChurches`:
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// ─── DB Helpers ───────────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
async function loadExistingChurches(country: string): Promise<ExistingChurch[]> {
|
|||
|
|
console.log(`Loading existing ${country} churches from DB...`);
|
|||
|
|
const churches = await prisma.church.findMany({
|
|||
|
|
where: { country },
|
|||
|
|
select: {
|
|||
|
|
id: true, name: true, latitude: true, longitude: true,
|
|||
|
|
osmId: true, baiduId: true, masstimesId: true,
|
|||
|
|
orarimesseId: true, massSchedulesPhId: true, philmassId: true,
|
|||
|
|
horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true,
|
|||
|
|
messesInfoId: true, bohosluzbyId: true, miserendId: true,
|
|||
|
|
kerknetId: true, gottesdienstzeitenId: true, discovermassId: true,
|
|||
|
|
buscarmisasNetworkId: true,
|
|||
|
|
source: true, website: true, phone: true, address: true, country: true,
|
|||
|
|
},
|
|||
|
|
});
|
|||
|
|
console.log(` Loaded ${churches.length} existing ${country} churches`);
|
|||
|
|
return churches as ExistingChurch[];
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Add `processChurch` function**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// ─── Church Processing ────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
async function processChurch(
|
|||
|
|
url: string,
|
|||
|
|
domain: string,
|
|||
|
|
config: SiteConfig,
|
|||
|
|
existingChurches: ExistingChurch[],
|
|||
|
|
args: CLIArgs,
|
|||
|
|
stats: ImportStats,
|
|||
|
|
): Promise<void> {
|
|||
|
|
stats.total++;
|
|||
|
|
try {
|
|||
|
|
const html = await fetchWithRetry(url);
|
|||
|
|
const parsed = parseChurchPage(html, domain, url, config);
|
|||
|
|
if (!parsed) {
|
|||
|
|
console.log(` [skip] No name/coords: ${url}`);
|
|||
|
|
stats.skipped++;
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
const masses = parseMassSchedule(html, config.country);
|
|||
|
|
|
|||
|
|
if (args.dryRun) {
|
|||
|
|
console.log(` [dry-run] ${parsed.name} — ${masses.length} masses`);
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
const candidate = {
|
|||
|
|
name: parsed.name,
|
|||
|
|
lat: parsed.lat,
|
|||
|
|
lng: parsed.lng,
|
|||
|
|
buscarmisasNetworkId: parsed.externalId,
|
|||
|
|
};
|
|||
|
|
const duplicate = findDuplicateChurch(candidate, existingChurches);
|
|||
|
|
|
|||
|
|
if (duplicate) {
|
|||
|
|
const updateData: Record<string, unknown> = { buscarmisasNetworkId: parsed.externalId };
|
|||
|
|
if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone;
|
|||
|
|
if (parsed.lat !== 0 && duplicate.latitude === 0) {
|
|||
|
|
updateData.latitude = parsed.lat;
|
|||
|
|
updateData.longitude = parsed.lng;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
await prisma.$transaction(async (tx) => {
|
|||
|
|
await tx.church.update({ where: { id: duplicate.id }, data: updateData });
|
|||
|
|
if (masses.length > 0) {
|
|||
|
|
await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } });
|
|||
|
|
await tx.massSchedule.createMany({
|
|||
|
|
data: masses.map(m => ({ churchId: duplicate.id, dayOfWeek: m.dayOfWeek, time: m.time, language: config.language === 'pt' ? 'Portuguese' : 'Spanish', notes: null })),
|
|||
|
|
});
|
|||
|
|
}
|
|||
|
|
await tx.church.update({ where: { id: duplicate.id }, data: { lastScrapedAt: new Date() } });
|
|||
|
|
});
|
|||
|
|
duplicate.buscarmisasNetworkId = parsed.externalId;
|
|||
|
|
stats.updated++;
|
|||
|
|
} else {
|
|||
|
|
const church = await prisma.church.create({
|
|||
|
|
data: {
|
|||
|
|
name: parsed.name,
|
|||
|
|
address: parsed.address,
|
|||
|
|
city: parsed.city,
|
|||
|
|
state: parsed.state,
|
|||
|
|
country: parsed.country,
|
|||
|
|
phone: parsed.phone,
|
|||
|
|
latitude: parsed.lat,
|
|||
|
|
longitude: parsed.lng,
|
|||
|
|
buscarmisasNetworkId: parsed.externalId,
|
|||
|
|
source: 'buscarmisas-network',
|
|||
|
|
hasWebsite: false,
|
|||
|
|
},
|
|||
|
|
});
|
|||
|
|
|
|||
|
|
existingChurches.push({
|
|||
|
|
id: church.id, name: parsed.name, latitude: parsed.lat, longitude: parsed.lng,
|
|||
|
|
osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
|
|||
|
|
massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
|
|||
|
|
mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
|
|||
|
|
bohosluzbyId: null, miserendId: null, kerknetId: null,
|
|||
|
|
gottesdienstzeitenId: null, discovermassId: null,
|
|||
|
|
buscarmisasNetworkId: parsed.externalId,
|
|||
|
|
source: 'buscarmisas-network', website: null, phone: parsed.phone,
|
|||
|
|
address: parsed.address, country: parsed.country,
|
|||
|
|
});
|
|||
|
|
|
|||
|
|
if (masses.length > 0) {
|
|||
|
|
await prisma.massSchedule.createMany({
|
|||
|
|
data: masses.map(m => ({
|
|||
|
|
churchId: church.id,
|
|||
|
|
dayOfWeek: m.dayOfWeek,
|
|||
|
|
time: m.time,
|
|||
|
|
language: config.language === 'pt' ? 'Portuguese' : 'Spanish',
|
|||
|
|
notes: null,
|
|||
|
|
})),
|
|||
|
|
});
|
|||
|
|
await prisma.church.update({ where: { id: church.id }, data: { lastScrapedAt: new Date() } });
|
|||
|
|
}
|
|||
|
|
stats.created++;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
stats.massSchedulesCreated += masses.length;
|
|||
|
|
console.log(
|
|||
|
|
` [${duplicate ? 'update' : 'create'}] ${parsed.name} — ${masses.length} masses — ` +
|
|||
|
|
`${stats.total} total (${stats.created}↑ ${stats.updated}↻ ${stats.errors}✗)`
|
|||
|
|
);
|
|||
|
|
} catch (err) {
|
|||
|
|
stats.errors++;
|
|||
|
|
console.error(` [error] ${url}: ${err instanceof Error ? err.message : err}`);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Compile-check the file so far**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsc --noEmit
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: 0 errors.
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add scripts/import-buscarmisas-network.ts
|
|||
|
|
git commit -m "feat: add buscarmisas-network importer — DB helpers and church processing"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Task 6: CLI parsing and `main()` function
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Modify: `scripts/import-buscarmisas-network.ts`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Add `parseCLIArgs`**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// ─── CLI ──────────────────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
function parseCLIArgs(): CLIArgs {
|
|||
|
|
const argv = process.argv.slice(2);
|
|||
|
|
const result: CLIArgs = { domain: null, all: false, dryRun: false, resumeFrom: 0, limit: null, jobId: null };
|
|||
|
|
for (let i = 0; i < argv.length; i++) {
|
|||
|
|
switch (argv[i]) {
|
|||
|
|
case '--domain': result.domain = argv[++i]; break;
|
|||
|
|
case '--all': result.all = true; break;
|
|||
|
|
case '--dry-run': result.dryRun = true; break;
|
|||
|
|
case '--resume-from': result.resumeFrom = parseInt(argv[++i], 10); break;
|
|||
|
|
case '--limit': result.limit = parseInt(argv[++i], 10); break;
|
|||
|
|
case '--job-id': result.jobId = argv[++i]; break;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
return result;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
function validateArgs(args: CLIArgs): void {
|
|||
|
|
if (!args.domain && !args.all) {
|
|||
|
|
console.error('Usage:');
|
|||
|
|
console.error(' npx tsx scripts/import-buscarmisas-network.ts --domain <domain>');
|
|||
|
|
console.error(' npx tsx scripts/import-buscarmisas-network.ts --all');
|
|||
|
|
console.error('\nValid domains:', Object.keys(NETWORK_SITES).join(', '));
|
|||
|
|
process.exit(1);
|
|||
|
|
}
|
|||
|
|
if (args.domain && !NETWORK_SITES[args.domain]) {
|
|||
|
|
console.error(`Unknown domain: ${args.domain}`);
|
|||
|
|
console.error('Valid domains:', Object.keys(NETWORK_SITES).join(', '));
|
|||
|
|
process.exit(1);
|
|||
|
|
}
|
|||
|
|
if (args.all && args.resumeFrom > 0) {
|
|||
|
|
console.error('--resume-from cannot be used with --all. Use --domain to resume a specific site.');
|
|||
|
|
process.exit(1);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Add `runDomain` function**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
async function runDomain(domain: string, config: SiteConfig, args: CLIArgs): Promise<ImportStats> {
|
|||
|
|
const stats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 };
|
|||
|
|
|
|||
|
|
const allUrls = await getChurchUrls(domain, config);
|
|||
|
|
const existingChurches = await loadExistingChurches(config.country);
|
|||
|
|
|
|||
|
|
// Build set of already-imported IDs for fast skip
|
|||
|
|
const importedIds = new Set(
|
|||
|
|
existingChurches.filter(c => c.buscarmisasNetworkId).map(c => c.buscarmisasNetworkId!)
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
let candidateUrls = allUrls.slice(args.resumeFrom).filter(url => {
|
|||
|
|
const externalId = buildExternalId(domain, url);
|
|||
|
|
return !importedIds.has(externalId);
|
|||
|
|
});
|
|||
|
|
if (args.limit !== null) candidateUrls = candidateUrls.slice(0, args.limit);
|
|||
|
|
|
|||
|
|
console.log(`\n${domain}: ${allUrls.length} total | ${importedIds.size} already imported | ${candidateUrls.length} to process\n`);
|
|||
|
|
|
|||
|
|
for (let i = 0; i < candidateUrls.length; i++) {
|
|||
|
|
const url = candidateUrls[i];
|
|||
|
|
console.log(`[${i + 1}/${candidateUrls.length}] ${url}`);
|
|||
|
|
await processChurch(url, domain, config, existingChurches, args, stats);
|
|||
|
|
if (i < candidateUrls.length - 1) await sleep(REQUEST_DELAY_MS);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
return stats;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Add `main()` function**
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
// ─── Main ─────────────────────────────────────────────────────────────────────
|
|||
|
|
|
|||
|
|
async function main() {
|
|||
|
|
const args = parseCLIArgs();
|
|||
|
|
validateArgs(args);
|
|||
|
|
|
|||
|
|
if (args.jobId) {
|
|||
|
|
try {
|
|||
|
|
await prisma.backgroundJob.update({
|
|||
|
|
where: { id: args.jobId },
|
|||
|
|
data: { status: 'running', startedAt: new Date() },
|
|||
|
|
});
|
|||
|
|
} catch { /* job may not exist yet */ }
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
const domainsToRun: [string, SiteConfig][] = args.all
|
|||
|
|
? Object.entries(NETWORK_SITES)
|
|||
|
|
: [[args.domain!, NETWORK_SITES[args.domain!]]];
|
|||
|
|
|
|||
|
|
const totalStats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 };
|
|||
|
|
|
|||
|
|
try {
|
|||
|
|
for (let d = 0; d < domainsToRun.length; d++) {
|
|||
|
|
const [domain, config] = domainsToRun[d];
|
|||
|
|
console.log(`\n${'─'.repeat(60)}`);
|
|||
|
|
console.log(`Domain ${d + 1}/${domainsToRun.length}: ${domain} (${config.country})`);
|
|||
|
|
console.log('─'.repeat(60));
|
|||
|
|
const stats = await runDomain(domain, config, args);
|
|||
|
|
totalStats.total += stats.total;
|
|||
|
|
totalStats.created += stats.created;
|
|||
|
|
totalStats.updated += stats.updated;
|
|||
|
|
totalStats.skipped += stats.skipped;
|
|||
|
|
totalStats.errors += stats.errors;
|
|||
|
|
totalStats.massSchedulesCreated += stats.massSchedulesCreated;
|
|||
|
|
if (d < domainsToRun.length - 1) await sleep(DOMAIN_DELAY_MS);
|
|||
|
|
}
|
|||
|
|
} finally {
|
|||
|
|
console.log('\n─── Import Complete ───────────────────────────────────────');
|
|||
|
|
console.log(`Total processed: ${totalStats.total}`);
|
|||
|
|
console.log(`Created: ${totalStats.created}`);
|
|||
|
|
console.log(`Updated: ${totalStats.updated}`);
|
|||
|
|
console.log(`Skipped: ${totalStats.skipped}`);
|
|||
|
|
console.log(`Errors: ${totalStats.errors}`);
|
|||
|
|
console.log(`Mass schedules: ${totalStats.massSchedulesCreated}`);
|
|||
|
|
|
|||
|
|
if (args.jobId) {
|
|||
|
|
const status = totalStats.errors > totalStats.total * 0.1 ? 'failed' : 'completed';
|
|||
|
|
try {
|
|||
|
|
await prisma.backgroundJob.update({
|
|||
|
|
where: { id: args.jobId },
|
|||
|
|
data: {
|
|||
|
|
status,
|
|||
|
|
completedAt: new Date(),
|
|||
|
|
processed: totalStats.total,
|
|||
|
|
succeeded: totalStats.created + totalStats.updated,
|
|||
|
|
failed: totalStats.errors,
|
|||
|
|
itemsFound: totalStats.massSchedulesCreated,
|
|||
|
|
},
|
|||
|
|
});
|
|||
|
|
} catch { /* ignore */ }
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
await prisma.$disconnect();
|
|||
|
|
await pool.end();
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
main().catch(err => {
|
|||
|
|
console.error('Fatal error:', err);
|
|||
|
|
process.exit(1);
|
|||
|
|
});
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Final compile check**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsc --noEmit
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: 0 errors.
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add scripts/import-buscarmisas-network.ts
|
|||
|
|
git commit -m "feat: add buscarmisas-network importer — CLI + main loop"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Chunk 4: Integration + smoke test
|
|||
|
|
|
|||
|
|
### Task 7: package.json and scheduler integration
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Modify: `package.json`
|
|||
|
|
- Modify: `scripts/scheduler.ts`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Add npm script to `package.json`**
|
|||
|
|
|
|||
|
|
In the `"scripts"` section, add after `"import:gcatholic"`:
|
|||
|
|
```json
|
|||
|
|
"import:buscarmisas-network": "tsx scripts/import-buscarmisas-network.ts",
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Add 5 case blocks to `getJobCommand` in `scheduler.ts`**
|
|||
|
|
|
|||
|
|
In `scripts/scheduler.ts`, find the `case 'discovermass-import':` block (around line 240). After it, before the `default:` case, add:
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
case 'buscarmisas-network-BR': {
|
|||
|
|
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariosmissa.com.br'];
|
|||
|
|
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
|||
|
|
return { command: 'npx', args };
|
|||
|
|
}
|
|||
|
|
case 'buscarmisas-network-MX': {
|
|||
|
|
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'buscarmisas.com.mx'];
|
|||
|
|
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
|||
|
|
return { command: 'npx', args };
|
|||
|
|
}
|
|||
|
|
case 'buscarmisas-network-AR': {
|
|||
|
|
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariosmisa.com.ar'];
|
|||
|
|
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
|||
|
|
return { command: 'npx', args };
|
|||
|
|
}
|
|||
|
|
case 'buscarmisas-network-CO': {
|
|||
|
|
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'buscarmisas.co'];
|
|||
|
|
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
|||
|
|
return { command: 'npx', args };
|
|||
|
|
}
|
|||
|
|
case 'buscarmisas-network-CL': {
|
|||
|
|
const args = ['tsx', 'scripts/import-buscarmisas-network.ts', '--domain', 'horariomisa.cl'];
|
|||
|
|
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
|||
|
|
return { command: 'npx', args };
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Add 5 pipeline phases to `PIPELINE_GROUPS[0].phases` in `scheduler.ts`**
|
|||
|
|
|
|||
|
|
In `scripts/scheduler.ts`, find the `PIPELINE_GROUPS` array. Inside the first group (`name: 'imports'`), add after the `discovermass-import` phase:
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
{ name: 'buscarmisas-network-BR', type: 'buscarmisas-network-BR', config: {} },
|
|||
|
|
{ name: 'buscarmisas-network-MX', type: 'buscarmisas-network-MX', config: {} },
|
|||
|
|
{ name: 'buscarmisas-network-AR', type: 'buscarmisas-network-AR', config: {} },
|
|||
|
|
{ name: 'buscarmisas-network-CO', type: 'buscarmisas-network-CO', config: {} },
|
|||
|
|
{ name: 'buscarmisas-network-CL', type: 'buscarmisas-network-CL', config: {} },
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: TypeScript compile check**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsc --noEmit
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: 0 errors.
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add package.json scripts/scheduler.ts
|
|||
|
|
git commit -m "feat: add buscarmisas-network to package.json and scheduler pipeline"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Task 8: Smoke test against live sites
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Dry-run Brazil (verifies parsing + sitemap, no DB writes)**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --dry-run
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: prints church names with mass counts, no DB errors, >4,000 URLs discovered.
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Live run — 3 churches from Brazil**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx scripts/import-buscarmisas-network.ts --domain horariosmissa.com.br --limit 3
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: 3 churches created in DB, mass schedules created, no errors.
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Verify in DB**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx -e "
|
|||
|
|
import { Pool } from 'pg';
|
|||
|
|
import { PrismaPg } from '@prisma/adapter-pg';
|
|||
|
|
import { PrismaClient } from '@prisma/client';
|
|||
|
|
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
|||
|
|
const prisma = new PrismaClient({ adapter: new PrismaPg(pool) });
|
|||
|
|
const churches = await prisma.church.findMany({
|
|||
|
|
where: { source: 'buscarmisas-network' },
|
|||
|
|
select: { name: true, country: true, buscarmisasNetworkId: true, latitude: true, longitude: true },
|
|||
|
|
take: 5,
|
|||
|
|
});
|
|||
|
|
console.table(churches);
|
|||
|
|
const massCount = await prisma.massSchedule.count({
|
|||
|
|
where: { church: { source: 'buscarmisas-network' } },
|
|||
|
|
});
|
|||
|
|
console.log('Mass schedules created:', massCount);
|
|||
|
|
await prisma.\$disconnect(); await pool.end();
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: 3 rows with source `buscarmisas-network`, valid lat/lng, `buscarmisasNetworkId` populated.
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Test idempotency (re-run should skip already-imported)**
|
|||
|
|
|
|||
|
|
Re-run the same limited test. Expected: `0 to process` (all skipped via the `importedIds` Set).
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Dry-run Chile (verifies post-sitemap path)**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx tsx scripts/import-buscarmisas-network.ts --domain horariomisa.cl --dry-run
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: ~935 URLs discovered, Spanish day names parsed correctly.
|
|||
|
|
|
|||
|
|
- [ ] **Step 6: Final commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add scripts/import-buscarmisas-network.ts package.json scripts/scheduler.ts
|
|||
|
|
git commit -m "feat: complete buscarmisas-network importer — Brazil, Mexico, Argentina, Colombia, Chile"
|
|||
|
|
```
|