Files
ScraperControl/docs/superpowers/plans/2026-03-10-brazil-spain-importers.md

1384 lines
46 KiB
Markdown
Raw Normal View History

# Brazil + Spain Importers Implementation Plan
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add two new church importers — horariodemissa.com.br (8,895 Brazilian churches + 28,523 mass times) and misas.org (17,919 Spanish churches with coordinates).
**Architecture:** Chunk 1 (shared prerequisites) must complete first. Tasks 35 (Brazil) and Tasks 67 (Spain) are independent and can run in parallel as subagents. All scripts follow the established importer pattern: fetch → regex parse → church-matcher dedup → prisma upsert.
**Tech Stack:** TypeScript, tsx, native `fetch`, regex HTML parsing (matchAll), Prisma + pg, church-matcher
**Spec:** `docs/superpowers/specs/2026-03-10-brazil-spain-importers-design.md`
---
## Chunk 1: Shared Prerequisites (schema + church-matcher)
### Task 1: Schema additions
**Files:**
- Modify: `prisma/schema.prisma`
- [ ] **Step 1: Add two new ID fields to the Church model**
In `prisma/schema.prisma`, find the block of importer ID fields (near `gottesdienstzeitenId`) and add after it:
```prisma
horarioDemissaId String? @unique @map("horario_demissa_id")
misasOrgId String? @unique @map("misas_org_id")
```
Then add two indexes in the `@@index` block at the bottom of the Church model:
```prisma
@@index([horarioDemissaId])
@@index([misasOrgId])
```
- [ ] **Step 2: Regenerate Prisma client**
```bash
npx prisma generate
```
Expected: `✔ Generated Prisma Client` with no errors.
- [ ] **Step 3: Verify the fields exist in generated types**
```bash
grep -n "horarioDemissaId\|misasOrgId" node_modules/.prisma/client/index.d.ts | head -10
```
Expected: both fields appear in the type definitions.
- [ ] **Step 4: Commit**
```bash
git add prisma/schema.prisma
git commit -m "feat: add horarioDemissaId and misasOrgId fields to Church schema"
```
---
### Task 2: church-matcher updates
**Files:**
- Modify: `src/lib/church-matcher.ts`
- [ ] **Step 1: Add new fields to ExistingChurch interface**
In `src/lib/church-matcher.ts`, find `ExistingChurch` interface and add after `gottesdienstzeitenId`:
```typescript
horarioDemissaId: string | null;
misasOrgId: string | null;
```
- [ ] **Step 2: Add new fields to ChurchCandidate type**
Find `ChurchCandidate` type and add after `gottesdienstzeitenId?`:
```typescript
horarioDemissaId?: string;
misasOrgId?: string;
```
- [ ] **Step 3: Add two new exact-match passes in findDuplicateChurch**
After the Thirteenth pass (gottesdienstzeitenId), add before the proximity pass:
```typescript
// Fourteenth pass: exact horarioDemissaId match
if (candidate.horarioDemissaId) {
const match = existingChurches.find(
(church) => church.horarioDemissaId === candidate.horarioDemissaId
);
if (match) return match;
}
// Fifteenth pass: exact misasOrgId match
if (candidate.misasOrgId) {
const match = existingChurches.find(
(church) => church.misasOrgId === candidate.misasOrgId
);
if (match) return match;
}
```
- [ ] **Step 4: Verify TypeScript compiles**
```bash
npx tsc --noEmit
```
Expected: no errors.
- [ ] **Step 5: Commit**
```bash
git add src/lib/church-matcher.ts
git commit -m "feat: add horarioDemissaId and misasOrgId to church-matcher"
```
---
## Chunk 2: Brazil Importer (import-horariodemissa.ts)
> Depends on Chunk 1. Can run in parallel with Chunk 3.
### Task 3: Boilerplate + sitemap enumeration
**Files:**
- Create: `scripts/import-horariodemissa.ts`
- [ ] **Step 1: Create script with boilerplate + types + sitemap parsing**
Create `scripts/import-horariodemissa.ts`:
```typescript
#!/usr/bin/env tsx
/**
* Import Catholic churches and mass schedules from horariodemissa.com.br (Brazil)
*
* horariodemissa.com.br has 8,895 churches across all 26 Brazilian states + DF,
* with 28,523 mass times. All data is server-rendered — one HTTP request per city
* page returns all churches + schedules for that city.
*
* City pages have a split structure:
* - Address/phone: embedded in JS h.push() strings (sidebar/map data)
* - Schedules: in server-rendered .result divs with <table> rows
* Both sets are linked by the same church key (e.g. "dvey2").
*
* Import strategy:
* 1. Fetch sitemap.xml → deduplicate to pt-only city URLs (~3,552 cities)
* 2. For each city: fetch page → parse address/phone from JS + schedules from DOM
* 3. Join by church key, match against existing BR churches, upsert
* 4. Optional --geocode flag for Nominatim pass after import
*
* Usage:
* npx tsx scripts/import-horariodemissa.ts --all
* npx tsx scripts/import-horariodemissa.ts --all --dry-run
* npx tsx scripts/import-horariodemissa.ts --state SP
* npx tsx scripts/import-horariodemissa.ts --all --resume-from 500
* npx tsx scripts/import-horariodemissa.ts --all --geocode
* npx tsx scripts/import-horariodemissa.ts --geocode-only
* npx tsx scripts/import-horariodemissa.ts --all --job-id {uuid}
*/
import dotenv from 'dotenv';
import path from 'path';
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
connectionString: dbUrl,
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });
import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';
// ─── Constants ───────────────────────────────────────────────────────────────
const SITE_BASE = 'https://horariodemissa.com.br';
const SITEMAP_URL = `${SITE_BASE}/sitemap.xml`;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 1500;
const NOMINATIM_DELAY_MS = 1100;
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';
// ─── Types ───────────────────────────────────────────────────────────────────
interface CityUrl {
state: string; // e.g. "SP"
city: string; // e.g. "São Paulo"
url: string; // full fetch URL
}
interface ParsedSchedule {
dayOfWeek: number; // 0=Sun, 1=Mon, ..., 6=Sat
time: string; // "HH:MM"
notes: string | null;
}
interface ParsedConfession {
dayOfWeek: number;
startTime: string;
endTime: string;
notes: string | null;
}
interface ParsedChurch {
key: string; // e.g. "dvey2" (used as horarioDemissaId)
name: string;
address: string | null;
phone: string | null;
city: string;
state: string;
massSchedules: ParsedSchedule[];
confessionSchedules: ParsedConfession[];
}
interface CLIArgs {
all: boolean;
state?: string;
dryRun: boolean;
geocode: boolean;
geocodeOnly: boolean;
resumeFrom?: number;
jobId?: string;
}
interface ImportStats {
citiesProcessed: number;
churchesFound: number;
churchesCreated: number;
churchesUpdated: number;
massSchedulesCreated: number;
geocoded: number;
geocodeFailed: number;
errors: number;
}
// ─── Brazilian Day Name Mapping ───────────────────────────────────────────────
const DAY_MAP: Record<string, number> = {
'domingo': 0,
'segunda-feira': 1, 'segunda': 1,
'terça-feira': 2, 'terca-feira': 2, 'terça': 2,
'quarta-feira': 3, 'quarta': 3,
'quinta-feira': 4, 'quinta': 4,
'sexta-feira': 5, 'sexta': 5,
'sábado': 6, 'sabado': 6,
};
const SPECIAL_DAY_MAP: Record<string, { dayOfWeek: number; notes: string }> = {
'primeiro domingo': { dayOfWeek: 0, notes: 'Primeiro Domingo' },
'segundo domingo': { dayOfWeek: 0, notes: 'Segundo Domingo' },
'terceiro domingo': { dayOfWeek: 0, notes: 'Terceiro Domingo' },
'quarto domingo': { dayOfWeek: 0, notes: 'Quarto Domingo' },
'primeiro sábado': { dayOfWeek: 6, notes: 'Primeiro Sábado' },
'primeiro sabado': { dayOfWeek: 6, notes: 'Primeiro Sábado' },
'segundo sábado': { dayOfWeek: 6, notes: 'Segundo Sábado' },
'segundo sabado': { dayOfWeek: 6, notes: 'Segundo Sábado' },
};
// ─── HTTP Client ──────────────────────────────────────────────────────────────
let requestCount = 0;
function delay(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function fetchPage(url: string, delayMs: number = REQUEST_DELAY_MS): Promise<string | null> {
if (requestCount > 0) await delay(delayMs);
requestCount++;
try {
const response = await fetch(url, {
headers: {
'User-Agent': USER_AGENT,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'pt-BR,pt;q=0.9',
},
});
if (!response.ok) {
console.error(` HTTP ${response.status} for ${url}`);
return null;
}
return await response.text();
} catch (error) {
console.error(` Fetch error for ${url}: ${error instanceof Error ? error.message : error}`);
return null;
}
}
// ─── Sitemap Parser ───────────────────────────────────────────────────────────
export function parseCityUrlsFromSitemap(sitemapXml: string, filterState?: string): CityUrl[] {
const seen = new Set<string>();
const cities: CityUrl[] = [];
for (const match of sitemapXml.matchAll(/<loc>([^<]+)<\/loc>/g)) {
const rawUrl = match[1].replace(/&amp;/g, '&');
// Only pt-language city search pages
if (!rawUrl.includes('opcoes=cidade_opcoes') || rawUrl.includes('hl=en')) continue;
const ufMatch = rawUrl.match(/[?&]uf=([A-Z]+)/);
const cidadeMatch = rawUrl.match(/[?&]cidade=([^&]+)/);
if (!ufMatch || !cidadeMatch) continue;
const state = ufMatch[1];
const city = decodeURIComponent(cidadeMatch[1].replace(/\+/g, ' '));
if (filterState && state !== filterState.toUpperCase()) continue;
const key = `${state}:${city}`;
if (seen.has(key)) continue;
seen.add(key);
cities.push({ state, city, url: rawUrl });
}
cities.sort((a, b) => a.state.localeCompare(b.state) || a.city.localeCompare(b.city));
return cities;
}
async function fetchCityUrls(filterState?: string): Promise<CityUrl[]> {
console.log(`Fetching sitemap: ${SITEMAP_URL}`);
const xml = await fetchPage(SITEMAP_URL);
if (!xml) throw new Error('Failed to fetch sitemap');
const cities = parseCityUrlsFromSitemap(xml, filterState);
console.log(`Found ${cities.length} unique cities${filterState ? ` in ${filterState}` : ''}`);
return cities;
}
```
- [ ] **Step 2: Verify sitemap parsing works**
```bash
npx tsx -e "
import dotenv from 'dotenv';
dotenv.config({ path: '.env' });
const { parseCityUrlsFromSitemap } = await import('./scripts/import-horariodemissa.ts');
const xml = await fetch('https://horariodemissa.com.br/sitemap.xml').then(r => r.text());
const cities = parseCityUrlsFromSitemap(xml);
console.log('Total cities:', cities.length);
console.log('Sample:', JSON.stringify(cities.slice(0, 3), null, 2));
const states = [...new Set(cities.map(c => c.state))].sort();
console.log('States:', states.join(', '));
"
```
Expected: ~3,500 cities, states include SP, RJ, MG, RS, BA, DF, etc.
- [ ] **Step 3: Commit**
```bash
git add scripts/import-horariodemissa.ts
git commit -m "feat: horariodemissa importer scaffold + sitemap enumeration"
```
---
### Task 4: HTML parsing
**Files:**
- Modify: `scripts/import-horariodemissa.ts`
- [ ] **Step 1: Understand the dual-source page structure**
Each city page contains two data sources per church, joined by the same key (e.g. `dvey2`):
**Source A** — JS `h.push()` strings embedded in `<script>` (sidebar/map):
```
h.push('<p><strong><a href="igreja.php?k=dvey2">NAME</a></strong><br/>Rua X, 123</p><p><strong>Telefone:</strong> (11) 1234-5678</p>');
```
Contains: key, name, address, phone.
**Source B** — Server-rendered `.result` divs:
```html
<div class="result">
<a href="igreja.php?k=dvey2" class="result_title">NAME</a>
<p class="blockleft"><table>
<tr><td style="...">Domingo:</td><td>07:30, 10:30</td></tr>
</table></p>
</div>
```
Contains: key + schedule tables (first = masses, optional second = confessions).
- [ ] **Step 2: Add parseDayLabel, parseTimeCells, parseMassTable, parseConfessionTable**
```typescript
// ─── HTML Parsers ─────────────────────────────────────────────────────────────
export function parseDayLabel(label: string): { dayOfWeek: number; notes: string | null } | null {
const normalized = label.toLowerCase().replace(/:$/, '').trim();
if (SPECIAL_DAY_MAP[normalized]) {
const s = SPECIAL_DAY_MAP[normalized];
return { dayOfWeek: s.dayOfWeek, notes: s.notes };
}
if (DAY_MAP[normalized] !== undefined) {
return { dayOfWeek: DAY_MAP[normalized], notes: null };
}
return null;
}
export function parseTimeCells(timesText: string): Array<{ time: string; notes: string | null }> {
const results: Array<{ time: string; notes: string | null }> = [];
// Split by comma but not inside parentheses
const parts = timesText.split(/,(?![^(]*\))/);
for (const part of parts) {
const trimmed = part.trim();
if (!trimmed) continue;
const timeMatch = trimmed.match(/\b(\d{1,2}:\d{2})\b/);
if (!timeMatch) continue;
const [h, m] = timeMatch[1].split(':');
const time = `${h.padStart(2, '0')}:${m}`;
const notesMatch = trimmed.match(/\(([^)]+)\)/);
results.push({ time, notes: notesMatch ? notesMatch[1].trim() : null });
}
return results;
}
export function parseMassTable(tableHtml: string): ParsedSchedule[] {
const schedules: ParsedSchedule[] = [];
for (const rowMatch of tableHtml.matchAll(/<tr[^>]*>([\s\S]*?)<\/tr>/gi)) {
const tds = [...rowMatch[1].matchAll(/<td[^>]*>([\s\S]*?)<\/td>/gi)]
.map(m => m[1].replace(/<[^>]+>/g, '').trim());
if (tds.length < 2) continue;
const dayResult = parseDayLabel(tds[0]);
if (!dayResult) continue;
for (const { time, notes } of parseTimeCells(tds[1])) {
schedules.push({
dayOfWeek: dayResult.dayOfWeek,
time,
notes: [dayResult.notes, notes].filter(Boolean).join('; ') || null,
});
}
}
return schedules;
}
export function parseConfessionTable(tableHtml: string): ParsedConfession[] {
const confessions: ParsedConfession[] = [];
for (const rowMatch of tableHtml.matchAll(/<tr[^>]*>([\s\S]*?)<\/tr>/gi)) {
const tds = [...rowMatch[1].matchAll(/<td[^>]*>([\s\S]*?)<\/td>/gi)]
.map(m => m[1].replace(/<[^>]+>/g, '').trim());
if (tds.length < 2) continue;
const dayResult = parseDayLabel(tds[0]);
if (!dayResult) continue;
// "09:00 às 11:00" or "09:00 a 11:00"
const rangeMatch = tds[1].match(/(\d{1,2}:\d{2})\s+(?:às|a)\s+(\d{1,2}:\d{2})/i);
if (!rangeMatch) continue;
const pad = (t: string) => { const [hh, mm] = t.split(':'); return `${hh.padStart(2,'0')}:${mm}`; };
confessions.push({
dayOfWeek: dayResult.dayOfWeek,
startTime: pad(rangeMatch[1]),
endTime: pad(rangeMatch[2]),
notes: dayResult.notes,
});
}
return confessions;
}
/**
* Parse a full city page HTML into church records.
* Joins h.push() JS data (name/address/phone) with .result DOM (schedules) by church key.
*/
export function parseCityPage(html: string, city: string, state: string): ParsedChurch[] {
// Parse Source A: h.push() JS strings → name, address, phone
const jsData = new Map<string, { name: string; address: string | null; phone: string | null }>();
for (const pushMatch of html.matchAll(/h\.push\('([\s\S]*?)'\);/g)) {
const content = pushMatch[1].replace(/\\'/g, "'");
const keyMatch = content.match(/igreja\.php\?k=([a-zA-Z0-9]+)/);
if (!keyMatch) continue;
const nameMatch = content.match(/igreja\.php\?k=[^"]+">([^<]+)<\/a>/);
const addrMatch = content.match(/<br\/>([^<]+)<\/p>/);
const phoneMatch = content.match(/Telefone:<\/strong>\s*([^<]+)/);
jsData.set(keyMatch[1], {
name: nameMatch ? nameMatch[1].trim() : '',
address: addrMatch ? addrMatch[1].trim() || null : null,
phone: phoneMatch ? phoneMatch[1].trim() || null : null,
});
}
// Parse Source B: .result divs → schedules
// Use split() rather than a lookahead regex — lookahead with $ drops the last result div
const scheduleData = new Map<string, { massSchedules: ParsedSchedule[]; confessionSchedules: ParsedConfession[] }>();
const resultParts = html.split('<div class="result">');
for (let i = 1; i < resultParts.length; i++) {
const resultHtml = resultParts[i];
const keyMatch = resultHtml.match(/href="igreja\.php\?k=([a-zA-Z0-9]+)"/);
if (!keyMatch) continue;
const tables = [...resultHtml.matchAll(/<table>([\s\S]*?)<\/table>/g)].map(m => m[1]);
scheduleData.set(keyMatch[1], {
massSchedules: tables[0] ? parseMassTable(tables[0]) : [],
confessionSchedules: tables[1] ? parseConfessionTable(tables[1]) : [],
});
}
// Join both sources by church key — every church in jsData gets its schedules from scheduleData
const allKeys = new Set([...jsData.keys(), ...scheduleData.keys()]);
const churches: ParsedChurch[] = [];
for (const key of allKeys) {
const js = jsData.get(key);
const sched = scheduleData.get(key);
if (!js?.name) continue;
churches.push({
key,
name: js.name,
address: js.address,
phone: js.phone,
city,
state,
massSchedules: sched?.massSchedules ?? [],
confessionSchedules: sched?.confessionSchedules ?? [],
});
}
return churches;
}
```
- [ ] **Step 3: Verify parsing against a live city page**
```bash
npx tsx -e "
import dotenv from 'dotenv';
dotenv.config({ path: '.env' });
const { parseCityPage } = await import('./scripts/import-horariodemissa.ts');
const url = 'https://horariodemissa.com.br/search.php?uf=SP&cidade=S%C3%A3o+Paulo&bairro=&opcoes=cidade_opcoes&submit=12345678&hl=pt';
const html = await fetch(url, { headers: { 'User-Agent': 'NearestMass-Importer/1.0' } }).then(r => r.text());
const churches = parseCityPage(html, 'São Paulo', 'SP');
console.log('Churches found:', churches.length);
console.log('With schedules:', churches.filter(c => c.massSchedules.length > 0).length);
console.log('Sample:', JSON.stringify(churches[0], null, 2));
"
```
Expected: 20+ churches found, majority with mass schedules, first entry shows name/address/phone/schedules.
- [ ] **Step 4: Commit**
```bash
git add scripts/import-horariodemissa.ts
git commit -m "feat: horariodemissa HTML parser (day mapping, schedule tables, dual-source join)"
```
---
### Task 5: DB upsert + main()
**Files:**
- Modify: `scripts/import-horariodemissa.ts`
- [ ] **Step 1: Add geocode helper**
```typescript
// ─── Geocoding ────────────────────────────────────────────────────────────────
async function geocodeAddress(address: string, city: string, state: string): Promise<{ lat: number; lng: number } | null> {
const query = [address, city, state, 'Brasil'].filter(Boolean).join(', ');
const url = `${NOMINATIM_URL}?q=${encodeURIComponent(query)}&format=json&limit=1&countrycodes=br`;
await delay(NOMINATIM_DELAY_MS);
try {
const response = await fetch(url, {
headers: { 'User-Agent': USER_AGENT, 'Accept': 'application/json' },
});
if (!response.ok) return null;
const results = await response.json() as Array<{ lat: string; lon: string }>;
if (!results.length) return null;
return { lat: parseFloat(results[0].lat), lng: parseFloat(results[0].lon) };
} catch {
return null;
}
}
```
- [ ] **Step 2: Add upsertChurch function**
Note: `latitude`/`longitude` are non-nullable in the schema. Use `0` as the sentinel for "no coordinates yet" (geocode pass will fill these in). The `source` field must be set explicitly — the schema default is `"masstimes"` which would corrupt source-based queries.
```typescript
// ─── DB Upsert ────────────────────────────────────────────────────────────────
async function upsertChurch(
parsed: ParsedChurch,
existingChurches: ExistingChurch[],
args: CLIArgs,
stats: ImportStats
): Promise<void> {
const candidate = { name: parsed.name, lat: 0, lng: 0, horarioDemissaId: parsed.key };
const existing = findDuplicateChurch(candidate, existingChurches);
if (args.dryRun) {
console.log(` [dry-run] ${existing ? 'UPDATE' : 'CREATE'} ${parsed.name} (${parsed.key})`);
if (existing) stats.churchesUpdated++; else stats.churchesCreated++;
return;
}
try {
let churchId: string;
await prisma.$transaction(async (tx) => {
const church = await tx.church.upsert({
where: { horarioDemissaId: parsed.key },
create: {
horarioDemissaId: parsed.key,
name: parsed.name,
address: parsed.address,
city: parsed.city,
state: parsed.state,
country: 'BR',
phone: parsed.phone,
source: 'horario-demissa', // must set explicitly — schema default is "masstimes"
latitude: 0, // sentinel for "no coordinates"; geocode pass fills this in
longitude: 0,
lastScrapedAt: new Date(),
scrapeStrategy: 'horario-demissa',
},
update: {
name: parsed.name,
address: parsed.address ?? undefined,
city: parsed.city,
state: parsed.state,
phone: parsed.phone ?? undefined,
lastScrapedAt: new Date(),
},
});
churchId = church.id;
await tx.massSchedule.deleteMany({ where: { churchId: church.id } });
if (parsed.massSchedules.length > 0) {
// Deduplicate by day+time before inserting
const seen = new Set<string>();
const deduped = parsed.massSchedules.filter((s) => {
const k = `${s.dayOfWeek}:${s.time}`;
return seen.has(k) ? false : (seen.add(k), true);
});
await tx.massSchedule.createMany({
data: deduped.map((s) => ({
churchId: church.id,
dayOfWeek: s.dayOfWeek,
time: s.time,
notes: s.notes,
})),
});
stats.massSchedulesCreated += deduped.length;
}
await tx.confessionSchedule.deleteMany({ where: { churchId: church.id } });
if (parsed.confessionSchedules.length > 0) {
await tx.confessionSchedule.createMany({
data: parsed.confessionSchedules.map((c) => ({
churchId: church.id,
dayOfWeek: c.dayOfWeek,
startTime: c.startTime,
endTime: c.endTime,
notes: c.notes,
})),
});
}
});
if (existing) {
stats.churchesUpdated++;
} else {
stats.churchesCreated++;
// Use real DB UUID (churchId!) not the source key string
existingChurches.push({
id: churchId!, name: parsed.name, latitude: 0, longitude: 0,
osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
bohosluzbyId: null, miserendId: null, kerknetId: null,
gottesdienstzeitenId: null, horarioDemissaId: parsed.key, misasOrgId: null,
source: 'horario-demissa', website: null, phone: parsed.phone,
address: parsed.address, country: 'BR',
});
}
} catch (error) {
console.error(` Error upserting ${parsed.name}: ${error instanceof Error ? error.message : error}`);
stats.errors++;
}
}
```
- [ ] **Step 3: Add geocodeOnly pass**
Note: `latitude` is non-nullable (`Float` in schema), so `{ latitude: null }` will never match. Use `{ latitude: 0 }` — that is the sentinel value set on creation for address-only churches.
```typescript
async function runGeocodeOnly(stats: ImportStats): Promise<void> {
console.log('\nGeocoding Brazilian churches without coordinates...');
const churches = await prisma.church.findMany({
where: { horarioDemissaId: { not: null }, latitude: 0, address: { not: null } },
select: { id: true, name: true, address: true, city: true, state: true },
});
console.log(`Found ${churches.length} churches to geocode`);
for (const church of churches) {
const coords = await geocodeAddress(church.address!, church.city ?? '', church.state ?? '');
if (coords) {
await prisma.church.update({ where: { id: church.id }, data: { latitude: coords.lat, longitude: coords.lng } });
stats.geocoded++;
console.log(` Geocoded: ${church.name} → ${coords.lat}, ${coords.lng}`);
} else {
stats.geocodeFailed++;
}
}
}
```
- [ ] **Step 4: Add CLI arg parser + main()**
```typescript
// ─── CLI + Main ───────────────────────────────────────────────────────────────
function parseArgs(): CLIArgs {
const argv = process.argv.slice(2);
const idx = (flag: string) => argv.indexOf(flag);
return {
all: argv.includes('--all'),
state: idx('--state') >= 0 ? argv[idx('--state') + 1] : undefined,
dryRun: argv.includes('--dry-run'),
geocode: argv.includes('--geocode'),
geocodeOnly: argv.includes('--geocode-only'),
resumeFrom: idx('--resume-from') >= 0 ? parseInt(argv[idx('--resume-from') + 1], 10) : undefined,
jobId: idx('--job-id') >= 0 ? argv[idx('--job-id') + 1] : undefined,
};
}
async function main(): Promise<void> {
const args = parseArgs();
const stats: ImportStats = {
citiesProcessed: 0, churchesFound: 0, churchesCreated: 0,
churchesUpdated: 0, massSchedulesCreated: 0,
geocoded: 0, geocodeFailed: 0, errors: 0,
};
console.log('\n' + '='.repeat(70));
console.log('HORARIO DE MISSA (BRAZIL) IMPORTER');
console.log('='.repeat(70));
console.log(`Mode: ${args.geocodeOnly ? 'geocode-only' : args.dryRun ? 'dry-run' : 'import'}`);
if (args.state) console.log(`State filter: ${args.state}`);
if (args.resumeFrom) console.log(`Resume from: ${args.resumeFrom}`);
console.log(`Time: ${new Date().toISOString()}\n`);
try {
if (args.geocodeOnly) {
await runGeocodeOnly(stats);
} else if (args.all || args.state) {
console.log('Loading existing BR churches...');
const existingChurches = await prisma.church.findMany({
where: { country: 'BR' },
select: {
id: true, name: true, latitude: true, longitude: true,
osmId: true, baiduId: true, masstimesId: true, orarimesseId: true,
massSchedulesPhId: true, philmassId: true, horariosMisasId: true,
mszeInfoId: true, weekdayMassesId: true, messesInfoId: true,
bohosluzbyId: true, miserendId: true, kerknetId: true,
gottesdienstzeitenId: true, horarioDemissaId: true, misasOrgId: true,
source: true, website: true, phone: true, address: true, country: true,
},
}) as ExistingChurch[];
console.log(`Loaded ${existingChurches.length} existing BR churches\n`);
const cities = await fetchCityUrls(args.state);
const startIndex = args.resumeFrom ?? 0;
for (let i = startIndex; i < cities.length; i++) {
const { state, city, url } = cities[i];
console.log(`[${i + 1}/${cities.length}] ${state} / ${city}`);
const html = await fetchPage(url);
if (!html) { stats.errors++; continue; }
const churches = parseCityPage(html, city, state);
stats.churchesFound += churches.length;
stats.citiesProcessed++;
console.log(` ${churches.length} churches`);
for (const church of churches) {
await upsertChurch(church, existingChurches, args, stats);
}
if (args.geocode && !args.dryRun) {
for (const church of churches) {
if (!church.address) continue;
const dbChurch = await prisma.church.findUnique({
where: { horarioDemissaId: church.key },
select: { id: true, latitude: true },
});
// latitude === 0 is the sentinel for "no real coordinates yet"
if (dbChurch && dbChurch.latitude === 0) {
const coords = await geocodeAddress(church.address, church.city, church.state);
if (coords) {
await prisma.church.update({ where: { id: dbChurch.id }, data: { latitude: coords.lat, longitude: coords.lng } });
stats.geocoded++;
} else {
stats.geocodeFailed++;
}
}
}
}
}
} else {
console.error('Usage: --all | --state XX | --geocode-only');
process.exit(1);
}
} finally {
await prisma.$disconnect();
await pool.end();
}
console.log('\n' + '='.repeat(70));
console.log('SUMMARY');
console.log('='.repeat(70));
console.log(`Cities processed: ${stats.citiesProcessed}`);
console.log(`Churches found: ${stats.churchesFound}`);
console.log(` Created: ${stats.churchesCreated}`);
console.log(` Updated: ${stats.churchesUpdated}`);
console.log(` Errors: ${stats.errors}`);
console.log(`Mass schedules: ${stats.massSchedulesCreated}`);
if (args.geocode || args.geocodeOnly) {
console.log(`Geocoded: ${stats.geocoded} / Failed: ${stats.geocodeFailed}`);
}
console.log('='.repeat(70) + '\n');
}
main().catch(console.error);
```
- [ ] **Step 5: Test dry-run on small state**
```bash
npx tsx scripts/import-horariodemissa.ts --state DF --dry-run
```
Expected: Lists churches from Distrito Federal (Brasília) without DB writes.
- [ ] **Step 6: Test real import on smallest state (Roraima)**
```bash
npx tsx scripts/import-horariodemissa.ts --state RR
```
Then verify:
```bash
npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
import { prisma } from './src/lib/db.ts';
const count = await prisma.church.count({ where: { country: 'BR' } });
const sched = await prisma.massSchedule.count({ where: { church: { country: 'BR' } } });
console.log('BR churches:', count, '| Mass schedules:', sched);
await prisma.\$disconnect();
"
```
Expected: Some churches from Roraima with mass schedules in DB.
- [ ] **Step 7: Commit**
```bash
git add scripts/import-horariodemissa.ts
git commit -m "feat: complete horariodemissa importer (Brazil, 8895 churches + 28523 mass times)"
```
---
## Chunk 3: Spain Importer (import-misas.ts)
> Depends on Chunk 1. Can run in parallel with Chunk 2.
### Task 6: API pagination + boilerplate
**Files:**
- Create: `scripts/import-misas.ts`
- [ ] **Step 1: Create script with boilerplate + API pagination**
Create `scripts/import-misas.ts`:
```typescript
#!/usr/bin/env tsx
/**
* Import Catholic churches from misas.org (Spain)
*
* misas.org lists 17,919 Spanish parishes with name, address, coordinates,
* and province via a public JSON REST API. Mass schedules are auth-gated
* (401 on detail endpoint), so this importer creates/updates church records
* only — no schedule data.
*
* The listing API accepts offset-based pagination. We use Madrid as the center
* with a large radius (999999m) to cover all of Spain in a single stream.
*
* Import strategy:
* 1. Paginate GET /api/parishsearch?country=es&pos=[...]&offset=N&limit=500
* 2. For each parish: id, name, addr, loc (city), prov (province), zip, lat, long
* 3. Match against existing ES churches by misasOrgId or proximity+name
* 4. Upsert church record (no mass schedules)
*
* Usage:
* npx tsx scripts/import-misas.ts --all
* npx tsx scripts/import-misas.ts --all --dry-run
* npx tsx scripts/import-misas.ts --all --resume-from 5000
* npx tsx scripts/import-misas.ts --all --job-id {uuid}
*/
import dotenv from 'dotenv';
import path from 'path';
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
connectionString: dbUrl,
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });
import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';
// ─── Constants ───────────────────────────────────────────────────────────────
const API_BASE = 'https://misas.org/api/parishsearch';
// Madrid coordinates, large radius covers all of Spain
const SPAIN_POS = encodeURIComponent('[-3.7038,40.4168,999999]');
const PAGE_SIZE = 500;
const REQUEST_DELAY_MS = 500;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
// ─── Types ───────────────────────────────────────────────────────────────────
interface MisasParish {
id: number;
name: string;
uri: string;
addr: string;
loc: string; // city
prov: string; // province
zip: string;
lat: string;
long: string;
}
interface MisasApiResponse {
count: number;
pars: MisasParish[];
}
interface CLIArgs {
all: boolean;
dryRun: boolean;
resumeFrom?: number;
jobId?: string;
}
interface ImportStats {
total: number;
created: number;
updated: number;
errors: number;
}
// ─── HTTP Client ──────────────────────────────────────────────────────────────
let requestCount = 0;
function delay(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function fetchParishes(offset: number): Promise<MisasApiResponse | null> {
if (requestCount > 0) await delay(REQUEST_DELAY_MS);
requestCount++;
const url = `${API_BASE}?country=es&pos=${SPAIN_POS}&offset=${offset}&limit=${PAGE_SIZE}`;
try {
const response = await fetch(url, {
headers: {
'User-Agent': USER_AGENT,
'Accept': 'application/json',
'Referer': 'https://misas.org/',
},
});
if (!response.ok) {
console.error(` HTTP ${response.status} at offset ${offset}`);
return null;
}
return await response.json() as MisasApiResponse;
} catch (error) {
console.error(` Fetch error at offset ${offset}: ${error instanceof Error ? error.message : error}`);
return null;
}
}
// ─── Pagination ───────────────────────────────────────────────────────────────
export async function* paginateParishes(startOffset: number = 0): AsyncGenerator<MisasParish> {
let offset = startOffset;
let totalKnown = Infinity;
while (offset < totalKnown) {
console.log(` Fetching offset ${offset}${totalKnown < Infinity ? `/${totalKnown}` : ''}...`);
const data = await fetchParishes(offset);
if (!data || !data.pars || data.pars.length === 0) break;
totalKnown = data.count;
for (const parish of data.pars) {
yield parish;
}
offset += data.pars.length;
}
}
```
- [ ] **Step 2: Verify API returns expected data**
```bash
npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
const { paginateParishes } = await import('./scripts/import-misas.ts');
let count = 0;
for await (const p of paginateParishes()) {
if (count === 0) console.log('First parish:', JSON.stringify(p, null, 2));
count++;
if (count >= 5) break;
}
console.log('Fetched:', count, 'from first batch');
"
```
Expected: Parish objects with id, name, lat, long, addr, loc, prov fields.
- [ ] **Step 3: Commit**
```bash
git add scripts/import-misas.ts
git commit -m "feat: misas.org importer scaffold + API pagination"
```
---
### Task 7: DB upsert + main()
**Files:**
- Modify: `scripts/import-misas.ts`
- [ ] **Step 1: Add upsertParish + main()**
Note: `latitude`/`longitude` are `Float` (non-nullable) — use `0` as sentinel when coordinates are missing. Set `source` explicitly to `'misas-org'` — the schema default is `"masstimes"`.
```typescript
// ─── DB Upsert ────────────────────────────────────────────────────────────────
async function upsertParish(
parish: MisasParish,
existingChurches: ExistingChurch[],
args: CLIArgs,
stats: ImportStats
): Promise<void> {
const lat = parseFloat(parish.lat);
const lng = parseFloat(parish.long);
const misasOrgId = String(parish.id);
const resolvedLat = isNaN(lat) ? 0 : lat;
const resolvedLng = isNaN(lng) ? 0 : lng;
const candidate = {
name: parish.name,
lat: resolvedLat,
lng: resolvedLng,
misasOrgId,
};
const existing = findDuplicateChurch(candidate, existingChurches);
if (args.dryRun) {
console.log(` [dry-run] ${existing ? 'UPDATE' : 'CREATE'} ${parish.name} (${misasOrgId})`);
stats.total++;
if (existing) stats.updated++; else stats.created++;
return;
}
try {
const church = await prisma.church.upsert({
where: { misasOrgId },
create: {
misasOrgId,
name: parish.name,
address: parish.addr || null,
city: parish.loc || null,
state: parish.prov || null,
zip: parish.zip || null,
country: 'ES',
source: 'misas-org', // must set explicitly — schema default is "masstimes"
latitude: resolvedLat, // 0 = no real coordinates; misas.org provides coords for most
longitude: resolvedLng,
lastScrapedAt: new Date(),
scrapeStrategy: 'misas-org',
},
update: {
name: parish.name,
address: parish.addr || undefined,
city: parish.loc || undefined,
state: parish.prov || undefined,
zip: parish.zip || undefined,
// Only update coords if we have real values (don't overwrite good data with 0)
...(resolvedLat !== 0 && { latitude: resolvedLat, longitude: resolvedLng }),
misasOrgId, // stamp ID even if matched by proximity
lastScrapedAt: new Date(),
},
});
if (existing) {
stats.updated++;
} else {
stats.created++;
existingChurches.push({
id: church.id, name: parish.name,
latitude: resolvedLat, longitude: resolvedLng,
osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
bohosluzbyId: null, miserendId: null, kerknetId: null,
gottesdienstzeitenId: null, horarioDemissaId: null, misasOrgId,
source: 'misas-org', website: null, phone: null,
address: parish.addr || null, country: 'ES',
});
}
stats.total++;
} catch (error) {
console.error(` Error upserting ${parish.name}: ${error instanceof Error ? error.message : error}`);
stats.errors++;
stats.total++; // count errors in total so progress log fires correctly
}
}
// ─── CLI + Main ───────────────────────────────────────────────────────────────
// Note: --job-id is accepted for scheduler compatibility but BackgroundJob status
// tracking is not wired up in this importer (acceptable for v1 — add later if needed).
function parseArgs(): CLIArgs {
const argv = process.argv.slice(2);
const idx = (flag: string) => argv.indexOf(flag);
return {
all: argv.includes('--all'),
dryRun: argv.includes('--dry-run'),
resumeFrom: idx('--resume-from') >= 0 ? parseInt(argv[idx('--resume-from') + 1], 10) : undefined,
jobId: idx('--job-id') >= 0 ? argv[idx('--job-id') + 1] : undefined,
};
}
async function main(): Promise<void> {
const args = parseArgs();
const stats: ImportStats = { total: 0, created: 0, updated: 0, errors: 0 };
console.log('\n' + '='.repeat(70));
console.log('MISAS.ORG (SPAIN) IMPORTER');
console.log('='.repeat(70));
console.log(`Mode: ${args.dryRun ? 'dry-run' : 'import'}`);
if (args.resumeFrom) console.log(`Resume from offset: ${args.resumeFrom}`);
console.log(`Time: ${new Date().toISOString()}\n`);
if (!args.all) {
console.error('Usage: --all [--dry-run] [--resume-from N]');
process.exit(1);
}
try {
console.log('Loading existing ES churches...');
const existingChurches = await prisma.church.findMany({
where: { country: 'ES' },
select: {
id: true, name: true, latitude: true, longitude: true,
osmId: true, baiduId: true, masstimesId: true, orarimesseId: true,
massSchedulesPhId: true, philmassId: true, horariosMisasId: true,
mszeInfoId: true, weekdayMassesId: true, messesInfoId: true,
bohosluzbyId: true, miserendId: true, kerknetId: true,
gottesdienstzeitenId: true, horarioDemissaId: true, misasOrgId: true,
source: true, website: true, phone: true, address: true, country: true,
},
}) as ExistingChurch[];
console.log(`Loaded ${existingChurches.length} existing ES churches\n`);
for await (const parish of paginateParishes(args.resumeFrom ?? 0)) {
await upsertParish(parish, existingChurches, args, stats);
if (stats.total % 500 === 0) {
console.log(` Progress: ${stats.total} processed (${stats.created} created, ${stats.updated} updated)`);
}
}
} finally {
await prisma.$disconnect();
await pool.end();
}
console.log('\n' + '='.repeat(70));
console.log('SUMMARY');
console.log('='.repeat(70));
console.log(`Total processed: ${stats.total}`);
console.log(` Created: ${stats.created}`);
console.log(` Updated: ${stats.updated}`);
console.log(` Errors: ${stats.errors}`);
console.log('='.repeat(70) + '\n');
}
main().catch(console.error);
```
- [ ] **Step 2: Test dry-run end-to-end**
```bash
npx tsx scripts/import-misas.ts --all --dry-run 2>&1 | tail -20
```
Expected: Processes all 17,919 parishes, shows `Total processed: 17919` with created/updated split.
- [ ] **Step 3: Commit**
```bash
git add scripts/import-misas.ts
git commit -m "feat: complete misas.org importer (Spain, 17919 churches with coordinates)"
```
---
## Chunk 4: Integration
### Task 8: package.json + scheduler
**Files:**
- Modify: `package.json`
- Modify: `scripts/scheduler.ts`
- [ ] **Step 1: Add npm scripts**
In `package.json` `"scripts"` block, add after `"import:masstimes-api"`:
```json
"import:horariodemissa": "tsx scripts/import-horariodemissa.ts",
"import:misas": "tsx scripts/import-misas.ts"
```
- [ ] **Step 2: Add getJobCommand cases in scheduler.ts**
In `scripts/scheduler.ts`, add before `default:` in `getJobCommand()`:
```typescript
case 'horariodemissa-import': {
const args = ['tsx', 'scripts/import-horariodemissa.ts', '--all'];
if (config?.state) args.push('--state', String(config.state));
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
if (config?.geocode) args.push('--geocode');
return { command: 'npx', args };
}
case 'misas-import': {
const args = ['tsx', 'scripts/import-misas.ts', '--all'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
```
- [ ] **Step 3: Add to PIPELINE_GROUPS imports sequence**
In `PIPELINE_GROUPS[0].phases`, add after the `masstimes-api-import` entry:
```typescript
{ name: 'horariodemissa-import', type: 'horariodemissa-import', config: {} },
{ name: 'misas-import', type: 'misas-import', config: {} },
```
- [ ] **Step 4: Verify TypeScript**
```bash
npx tsc --noEmit
```
Expected: no errors.
- [ ] **Step 5: Smoke test both npm scripts**
```bash
npm run import:horariodemissa -- --state DF --dry-run 2>&1 | tail -10
npm run import:misas -- --all --dry-run 2>&1 | tail -10
```
- [ ] **Step 6: Commit**
```bash
git add package.json scripts/scheduler.ts
git commit -m "feat: add horariodemissa and misas.org to npm scripts and scheduler pipeline"
```
---
## Final Verification
- [ ] **Import small state from Brazil to confirm end-to-end**
```bash
npx tsx scripts/import-horariodemissa.ts --state DF
```
```bash
npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
import { prisma } from './src/lib/db.ts';
const churches = await prisma.church.count({ where: { country: 'BR' } });
const schedules = await prisma.massSchedule.count({ where: { church: { country: 'BR' } } });
console.log('BR churches:', churches, '| Mass schedules:', schedules);
await prisma.\$disconnect();
"
```
Expected: Distrito Federal churches in DB with mass schedules.
- [ ] **Dry-run Spain importer full pass**
```bash
npx tsx scripts/import-misas.ts --all --dry-run 2>&1 | grep -E "SUMMARY|Total|Created|Updated" | tail -10
```
Expected: ~17,919 total, mix of created vs updated depending on existing ES church overlap.