Files
ScraperControl/docs/superpowers/plans/2026-03-10-brazil-spain-importers.md
albertfj114 0e468bcb94 docs: add Brazil + Spain importers design spec and implementation plan
Two new importers:
- horariodemissa.com.br: 8,895 Brazilian churches + 28,523 mass times
- misas.org: 17,919 Spanish churches with coordinates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 19:50:54 -04:00

1384 lines
46 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Brazil + Spain Importers Implementation Plan
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add two new church importers — horariodemissa.com.br (8,895 Brazilian churches + 28,523 mass times) and misas.org (17,919 Spanish churches with coordinates).
**Architecture:** Chunk 1 (shared prerequisites) must complete first. Tasks 35 (Brazil) and Tasks 67 (Spain) are independent and can run in parallel as subagents. All scripts follow the established importer pattern: fetch → regex parse → church-matcher dedup → prisma upsert.
**Tech Stack:** TypeScript, tsx, native `fetch`, regex HTML parsing (matchAll), Prisma + pg, church-matcher
**Spec:** `docs/superpowers/specs/2026-03-10-brazil-spain-importers-design.md`
---
## Chunk 1: Shared Prerequisites (schema + church-matcher)
### Task 1: Schema additions
**Files:**
- Modify: `prisma/schema.prisma`
- [ ] **Step 1: Add two new ID fields to the Church model**
In `prisma/schema.prisma`, find the block of importer ID fields (near `gottesdienstzeitenId`) and add after it:
```prisma
horarioDemissaId String? @unique @map("horario_demissa_id")
misasOrgId String? @unique @map("misas_org_id")
```
Then add two indexes in the `@@index` block at the bottom of the Church model:
```prisma
@@index([horarioDemissaId])
@@index([misasOrgId])
```
- [ ] **Step 2: Regenerate Prisma client**
```bash
npx prisma generate
```
Expected: `✔ Generated Prisma Client` with no errors.
- [ ] **Step 3: Verify the fields exist in generated types**
```bash
grep -n "horarioDemissaId\|misasOrgId" node_modules/.prisma/client/index.d.ts | head -10
```
Expected: both fields appear in the type definitions.
- [ ] **Step 4: Commit**
```bash
git add prisma/schema.prisma
git commit -m "feat: add horarioDemissaId and misasOrgId fields to Church schema"
```
---
### Task 2: church-matcher updates
**Files:**
- Modify: `src/lib/church-matcher.ts`
- [ ] **Step 1: Add new fields to ExistingChurch interface**
In `src/lib/church-matcher.ts`, find `ExistingChurch` interface and add after `gottesdienstzeitenId`:
```typescript
horarioDemissaId: string | null;
misasOrgId: string | null;
```
- [ ] **Step 2: Add new fields to ChurchCandidate type**
Find `ChurchCandidate` type and add after `gottesdienstzeitenId?`:
```typescript
horarioDemissaId?: string;
misasOrgId?: string;
```
- [ ] **Step 3: Add two new exact-match passes in findDuplicateChurch**
After the Thirteenth pass (gottesdienstzeitenId), add before the proximity pass:
```typescript
// Fourteenth pass: exact horarioDemissaId match
if (candidate.horarioDemissaId) {
const match = existingChurches.find(
(church) => church.horarioDemissaId === candidate.horarioDemissaId
);
if (match) return match;
}
// Fifteenth pass: exact misasOrgId match
if (candidate.misasOrgId) {
const match = existingChurches.find(
(church) => church.misasOrgId === candidate.misasOrgId
);
if (match) return match;
}
```
- [ ] **Step 4: Verify TypeScript compiles**
```bash
npx tsc --noEmit
```
Expected: no errors.
- [ ] **Step 5: Commit**
```bash
git add src/lib/church-matcher.ts
git commit -m "feat: add horarioDemissaId and misasOrgId to church-matcher"
```
---
## Chunk 2: Brazil Importer (import-horariodemissa.ts)
> Depends on Chunk 1. Can run in parallel with Chunk 3.
### Task 3: Boilerplate + sitemap enumeration
**Files:**
- Create: `scripts/import-horariodemissa.ts`
- [ ] **Step 1: Create script with boilerplate + types + sitemap parsing**
Create `scripts/import-horariodemissa.ts`:
```typescript
#!/usr/bin/env tsx
/**
* Import Catholic churches and mass schedules from horariodemissa.com.br (Brazil)
*
* horariodemissa.com.br has 8,895 churches across all 26 Brazilian states + DF,
* with 28,523 mass times. All data is server-rendered — one HTTP request per city
* page returns all churches + schedules for that city.
*
* City pages have a split structure:
* - Address/phone: embedded in JS h.push() strings (sidebar/map data)
* - Schedules: in server-rendered .result divs with <table> rows
* Both sets are linked by the same church key (e.g. "dvey2").
*
* Import strategy:
* 1. Fetch sitemap.xml → deduplicate to pt-only city URLs (~3,552 cities)
* 2. For each city: fetch page → parse address/phone from JS + schedules from DOM
* 3. Join by church key, match against existing BR churches, upsert
* 4. Optional --geocode flag for Nominatim pass after import
*
* Usage:
* npx tsx scripts/import-horariodemissa.ts --all
* npx tsx scripts/import-horariodemissa.ts --all --dry-run
* npx tsx scripts/import-horariodemissa.ts --state SP
* npx tsx scripts/import-horariodemissa.ts --all --resume-from 500
* npx tsx scripts/import-horariodemissa.ts --all --geocode
* npx tsx scripts/import-horariodemissa.ts --geocode-only
* npx tsx scripts/import-horariodemissa.ts --all --job-id {uuid}
*/
import dotenv from 'dotenv';
import path from 'path';
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
connectionString: dbUrl,
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });
import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';
// ─── Constants ───────────────────────────────────────────────────────────────
const SITE_BASE = 'https://horariodemissa.com.br';
const SITEMAP_URL = `${SITE_BASE}/sitemap.xml`;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 1500;
const NOMINATIM_DELAY_MS = 1100;
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';
// ─── Types ───────────────────────────────────────────────────────────────────
interface CityUrl {
state: string; // e.g. "SP"
city: string; // e.g. "São Paulo"
url: string; // full fetch URL
}
interface ParsedSchedule {
dayOfWeek: number; // 0=Sun, 1=Mon, ..., 6=Sat
time: string; // "HH:MM"
notes: string | null;
}
interface ParsedConfession {
dayOfWeek: number;
startTime: string;
endTime: string;
notes: string | null;
}
interface ParsedChurch {
key: string; // e.g. "dvey2" (used as horarioDemissaId)
name: string;
address: string | null;
phone: string | null;
city: string;
state: string;
massSchedules: ParsedSchedule[];
confessionSchedules: ParsedConfession[];
}
interface CLIArgs {
all: boolean;
state?: string;
dryRun: boolean;
geocode: boolean;
geocodeOnly: boolean;
resumeFrom?: number;
jobId?: string;
}
interface ImportStats {
citiesProcessed: number;
churchesFound: number;
churchesCreated: number;
churchesUpdated: number;
massSchedulesCreated: number;
geocoded: number;
geocodeFailed: number;
errors: number;
}
// ─── Brazilian Day Name Mapping ───────────────────────────────────────────────
const DAY_MAP: Record<string, number> = {
'domingo': 0,
'segunda-feira': 1, 'segunda': 1,
'terça-feira': 2, 'terca-feira': 2, 'terça': 2,
'quarta-feira': 3, 'quarta': 3,
'quinta-feira': 4, 'quinta': 4,
'sexta-feira': 5, 'sexta': 5,
'sábado': 6, 'sabado': 6,
};
const SPECIAL_DAY_MAP: Record<string, { dayOfWeek: number; notes: string }> = {
'primeiro domingo': { dayOfWeek: 0, notes: 'Primeiro Domingo' },
'segundo domingo': { dayOfWeek: 0, notes: 'Segundo Domingo' },
'terceiro domingo': { dayOfWeek: 0, notes: 'Terceiro Domingo' },
'quarto domingo': { dayOfWeek: 0, notes: 'Quarto Domingo' },
'primeiro sábado': { dayOfWeek: 6, notes: 'Primeiro Sábado' },
'primeiro sabado': { dayOfWeek: 6, notes: 'Primeiro Sábado' },
'segundo sábado': { dayOfWeek: 6, notes: 'Segundo Sábado' },
'segundo sabado': { dayOfWeek: 6, notes: 'Segundo Sábado' },
};
// ─── HTTP Client ──────────────────────────────────────────────────────────────
let requestCount = 0;
function delay(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function fetchPage(url: string, delayMs: number = REQUEST_DELAY_MS): Promise<string | null> {
if (requestCount > 0) await delay(delayMs);
requestCount++;
try {
const response = await fetch(url, {
headers: {
'User-Agent': USER_AGENT,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'pt-BR,pt;q=0.9',
},
});
if (!response.ok) {
console.error(` HTTP ${response.status} for ${url}`);
return null;
}
return await response.text();
} catch (error) {
console.error(` Fetch error for ${url}: ${error instanceof Error ? error.message : error}`);
return null;
}
}
// ─── Sitemap Parser ───────────────────────────────────────────────────────────
export function parseCityUrlsFromSitemap(sitemapXml: string, filterState?: string): CityUrl[] {
const seen = new Set<string>();
const cities: CityUrl[] = [];
for (const match of sitemapXml.matchAll(/<loc>([^<]+)<\/loc>/g)) {
const rawUrl = match[1].replace(/&amp;/g, '&');
// Only pt-language city search pages
if (!rawUrl.includes('opcoes=cidade_opcoes') || rawUrl.includes('hl=en')) continue;
const ufMatch = rawUrl.match(/[?&]uf=([A-Z]+)/);
const cidadeMatch = rawUrl.match(/[?&]cidade=([^&]+)/);
if (!ufMatch || !cidadeMatch) continue;
const state = ufMatch[1];
const city = decodeURIComponent(cidadeMatch[1].replace(/\+/g, ' '));
if (filterState && state !== filterState.toUpperCase()) continue;
const key = `${state}:${city}`;
if (seen.has(key)) continue;
seen.add(key);
cities.push({ state, city, url: rawUrl });
}
cities.sort((a, b) => a.state.localeCompare(b.state) || a.city.localeCompare(b.city));
return cities;
}
async function fetchCityUrls(filterState?: string): Promise<CityUrl[]> {
console.log(`Fetching sitemap: ${SITEMAP_URL}`);
const xml = await fetchPage(SITEMAP_URL);
if (!xml) throw new Error('Failed to fetch sitemap');
const cities = parseCityUrlsFromSitemap(xml, filterState);
console.log(`Found ${cities.length} unique cities${filterState ? ` in ${filterState}` : ''}`);
return cities;
}
```
- [ ] **Step 2: Verify sitemap parsing works**
```bash
npx tsx -e "
import dotenv from 'dotenv';
dotenv.config({ path: '.env' });
const { parseCityUrlsFromSitemap } = await import('./scripts/import-horariodemissa.ts');
const xml = await fetch('https://horariodemissa.com.br/sitemap.xml').then(r => r.text());
const cities = parseCityUrlsFromSitemap(xml);
console.log('Total cities:', cities.length);
console.log('Sample:', JSON.stringify(cities.slice(0, 3), null, 2));
const states = [...new Set(cities.map(c => c.state))].sort();
console.log('States:', states.join(', '));
"
```
Expected: ~3,500 cities, states include SP, RJ, MG, RS, BA, DF, etc.
- [ ] **Step 3: Commit**
```bash
git add scripts/import-horariodemissa.ts
git commit -m "feat: horariodemissa importer scaffold + sitemap enumeration"
```
---
### Task 4: HTML parsing
**Files:**
- Modify: `scripts/import-horariodemissa.ts`
- [ ] **Step 1: Understand the dual-source page structure**
Each city page contains two data sources per church, joined by the same key (e.g. `dvey2`):
**Source A** — JS `h.push()` strings embedded in `<script>` (sidebar/map):
```
h.push('<p><strong><a href="igreja.php?k=dvey2">NAME</a></strong><br/>Rua X, 123</p><p><strong>Telefone:</strong> (11) 1234-5678</p>');
```
Contains: key, name, address, phone.
**Source B** — Server-rendered `.result` divs:
```html
<div class="result">
<a href="igreja.php?k=dvey2" class="result_title">NAME</a>
<p class="blockleft"><table>
<tr><td style="...">Domingo:</td><td>07:30, 10:30</td></tr>
</table></p>
</div>
```
Contains: key + schedule tables (first = masses, optional second = confessions).
- [ ] **Step 2: Add parseDayLabel, parseTimeCells, parseMassTable, parseConfessionTable**
```typescript
// ─── HTML Parsers ─────────────────────────────────────────────────────────────
export function parseDayLabel(label: string): { dayOfWeek: number; notes: string | null } | null {
const normalized = label.toLowerCase().replace(/:$/, '').trim();
if (SPECIAL_DAY_MAP[normalized]) {
const s = SPECIAL_DAY_MAP[normalized];
return { dayOfWeek: s.dayOfWeek, notes: s.notes };
}
if (DAY_MAP[normalized] !== undefined) {
return { dayOfWeek: DAY_MAP[normalized], notes: null };
}
return null;
}
export function parseTimeCells(timesText: string): Array<{ time: string; notes: string | null }> {
const results: Array<{ time: string; notes: string | null }> = [];
// Split by comma but not inside parentheses
const parts = timesText.split(/,(?![^(]*\))/);
for (const part of parts) {
const trimmed = part.trim();
if (!trimmed) continue;
const timeMatch = trimmed.match(/\b(\d{1,2}:\d{2})\b/);
if (!timeMatch) continue;
const [h, m] = timeMatch[1].split(':');
const time = `${h.padStart(2, '0')}:${m}`;
const notesMatch = trimmed.match(/\(([^)]+)\)/);
results.push({ time, notes: notesMatch ? notesMatch[1].trim() : null });
}
return results;
}
export function parseMassTable(tableHtml: string): ParsedSchedule[] {
const schedules: ParsedSchedule[] = [];
for (const rowMatch of tableHtml.matchAll(/<tr[^>]*>([\s\S]*?)<\/tr>/gi)) {
const tds = [...rowMatch[1].matchAll(/<td[^>]*>([\s\S]*?)<\/td>/gi)]
.map(m => m[1].replace(/<[^>]+>/g, '').trim());
if (tds.length < 2) continue;
const dayResult = parseDayLabel(tds[0]);
if (!dayResult) continue;
for (const { time, notes } of parseTimeCells(tds[1])) {
schedules.push({
dayOfWeek: dayResult.dayOfWeek,
time,
notes: [dayResult.notes, notes].filter(Boolean).join('; ') || null,
});
}
}
return schedules;
}
export function parseConfessionTable(tableHtml: string): ParsedConfession[] {
const confessions: ParsedConfession[] = [];
for (const rowMatch of tableHtml.matchAll(/<tr[^>]*>([\s\S]*?)<\/tr>/gi)) {
const tds = [...rowMatch[1].matchAll(/<td[^>]*>([\s\S]*?)<\/td>/gi)]
.map(m => m[1].replace(/<[^>]+>/g, '').trim());
if (tds.length < 2) continue;
const dayResult = parseDayLabel(tds[0]);
if (!dayResult) continue;
// "09:00 às 11:00" or "09:00 a 11:00"
const rangeMatch = tds[1].match(/(\d{1,2}:\d{2})\s+(?:às|a)\s+(\d{1,2}:\d{2})/i);
if (!rangeMatch) continue;
const pad = (t: string) => { const [hh, mm] = t.split(':'); return `${hh.padStart(2,'0')}:${mm}`; };
confessions.push({
dayOfWeek: dayResult.dayOfWeek,
startTime: pad(rangeMatch[1]),
endTime: pad(rangeMatch[2]),
notes: dayResult.notes,
});
}
return confessions;
}
/**
* Parse a full city page HTML into church records.
* Joins h.push() JS data (name/address/phone) with .result DOM (schedules) by church key.
*/
export function parseCityPage(html: string, city: string, state: string): ParsedChurch[] {
// Parse Source A: h.push() JS strings → name, address, phone
const jsData = new Map<string, { name: string; address: string | null; phone: string | null }>();
for (const pushMatch of html.matchAll(/h\.push\('([\s\S]*?)'\);/g)) {
const content = pushMatch[1].replace(/\\'/g, "'");
const keyMatch = content.match(/igreja\.php\?k=([a-zA-Z0-9]+)/);
if (!keyMatch) continue;
const nameMatch = content.match(/igreja\.php\?k=[^"]+">([^<]+)<\/a>/);
const addrMatch = content.match(/<br\/>([^<]+)<\/p>/);
const phoneMatch = content.match(/Telefone:<\/strong>\s*([^<]+)/);
jsData.set(keyMatch[1], {
name: nameMatch ? nameMatch[1].trim() : '',
address: addrMatch ? addrMatch[1].trim() || null : null,
phone: phoneMatch ? phoneMatch[1].trim() || null : null,
});
}
// Parse Source B: .result divs → schedules
// Use split() rather than a lookahead regex — lookahead with $ drops the last result div
const scheduleData = new Map<string, { massSchedules: ParsedSchedule[]; confessionSchedules: ParsedConfession[] }>();
const resultParts = html.split('<div class="result">');
for (let i = 1; i < resultParts.length; i++) {
const resultHtml = resultParts[i];
const keyMatch = resultHtml.match(/href="igreja\.php\?k=([a-zA-Z0-9]+)"/);
if (!keyMatch) continue;
const tables = [...resultHtml.matchAll(/<table>([\s\S]*?)<\/table>/g)].map(m => m[1]);
scheduleData.set(keyMatch[1], {
massSchedules: tables[0] ? parseMassTable(tables[0]) : [],
confessionSchedules: tables[1] ? parseConfessionTable(tables[1]) : [],
});
}
// Join both sources by church key — every church in jsData gets its schedules from scheduleData
const allKeys = new Set([...jsData.keys(), ...scheduleData.keys()]);
const churches: ParsedChurch[] = [];
for (const key of allKeys) {
const js = jsData.get(key);
const sched = scheduleData.get(key);
if (!js?.name) continue;
churches.push({
key,
name: js.name,
address: js.address,
phone: js.phone,
city,
state,
massSchedules: sched?.massSchedules ?? [],
confessionSchedules: sched?.confessionSchedules ?? [],
});
}
return churches;
}
```
- [ ] **Step 3: Verify parsing against a live city page**
```bash
npx tsx -e "
import dotenv from 'dotenv';
dotenv.config({ path: '.env' });
const { parseCityPage } = await import('./scripts/import-horariodemissa.ts');
const url = 'https://horariodemissa.com.br/search.php?uf=SP&cidade=S%C3%A3o+Paulo&bairro=&opcoes=cidade_opcoes&submit=12345678&hl=pt';
const html = await fetch(url, { headers: { 'User-Agent': 'NearestMass-Importer/1.0' } }).then(r => r.text());
const churches = parseCityPage(html, 'São Paulo', 'SP');
console.log('Churches found:', churches.length);
console.log('With schedules:', churches.filter(c => c.massSchedules.length > 0).length);
console.log('Sample:', JSON.stringify(churches[0], null, 2));
"
```
Expected: 20+ churches found, majority with mass schedules, first entry shows name/address/phone/schedules.
- [ ] **Step 4: Commit**
```bash
git add scripts/import-horariodemissa.ts
git commit -m "feat: horariodemissa HTML parser (day mapping, schedule tables, dual-source join)"
```
---
### Task 5: DB upsert + main()
**Files:**
- Modify: `scripts/import-horariodemissa.ts`
- [ ] **Step 1: Add geocode helper**
```typescript
// ─── Geocoding ────────────────────────────────────────────────────────────────
async function geocodeAddress(address: string, city: string, state: string): Promise<{ lat: number; lng: number } | null> {
const query = [address, city, state, 'Brasil'].filter(Boolean).join(', ');
const url = `${NOMINATIM_URL}?q=${encodeURIComponent(query)}&format=json&limit=1&countrycodes=br`;
await delay(NOMINATIM_DELAY_MS);
try {
const response = await fetch(url, {
headers: { 'User-Agent': USER_AGENT, 'Accept': 'application/json' },
});
if (!response.ok) return null;
const results = await response.json() as Array<{ lat: string; lon: string }>;
if (!results.length) return null;
return { lat: parseFloat(results[0].lat), lng: parseFloat(results[0].lon) };
} catch {
return null;
}
}
```
- [ ] **Step 2: Add upsertChurch function**
Note: `latitude`/`longitude` are non-nullable in the schema. Use `0` as the sentinel for "no coordinates yet" (geocode pass will fill these in). The `source` field must be set explicitly — the schema default is `"masstimes"` which would corrupt source-based queries.
```typescript
// ─── DB Upsert ────────────────────────────────────────────────────────────────
async function upsertChurch(
parsed: ParsedChurch,
existingChurches: ExistingChurch[],
args: CLIArgs,
stats: ImportStats
): Promise<void> {
const candidate = { name: parsed.name, lat: 0, lng: 0, horarioDemissaId: parsed.key };
const existing = findDuplicateChurch(candidate, existingChurches);
if (args.dryRun) {
console.log(` [dry-run] ${existing ? 'UPDATE' : 'CREATE'} ${parsed.name} (${parsed.key})`);
if (existing) stats.churchesUpdated++; else stats.churchesCreated++;
return;
}
try {
let churchId: string;
await prisma.$transaction(async (tx) => {
const church = await tx.church.upsert({
where: { horarioDemissaId: parsed.key },
create: {
horarioDemissaId: parsed.key,
name: parsed.name,
address: parsed.address,
city: parsed.city,
state: parsed.state,
country: 'BR',
phone: parsed.phone,
source: 'horario-demissa', // must set explicitly — schema default is "masstimes"
latitude: 0, // sentinel for "no coordinates"; geocode pass fills this in
longitude: 0,
lastScrapedAt: new Date(),
scrapeStrategy: 'horario-demissa',
},
update: {
name: parsed.name,
address: parsed.address ?? undefined,
city: parsed.city,
state: parsed.state,
phone: parsed.phone ?? undefined,
lastScrapedAt: new Date(),
},
});
churchId = church.id;
await tx.massSchedule.deleteMany({ where: { churchId: church.id } });
if (parsed.massSchedules.length > 0) {
// Deduplicate by day+time before inserting
const seen = new Set<string>();
const deduped = parsed.massSchedules.filter((s) => {
const k = `${s.dayOfWeek}:${s.time}`;
return seen.has(k) ? false : (seen.add(k), true);
});
await tx.massSchedule.createMany({
data: deduped.map((s) => ({
churchId: church.id,
dayOfWeek: s.dayOfWeek,
time: s.time,
notes: s.notes,
})),
});
stats.massSchedulesCreated += deduped.length;
}
await tx.confessionSchedule.deleteMany({ where: { churchId: church.id } });
if (parsed.confessionSchedules.length > 0) {
await tx.confessionSchedule.createMany({
data: parsed.confessionSchedules.map((c) => ({
churchId: church.id,
dayOfWeek: c.dayOfWeek,
startTime: c.startTime,
endTime: c.endTime,
notes: c.notes,
})),
});
}
});
if (existing) {
stats.churchesUpdated++;
} else {
stats.churchesCreated++;
// Use real DB UUID (churchId!) not the source key string
existingChurches.push({
id: churchId!, name: parsed.name, latitude: 0, longitude: 0,
osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
bohosluzbyId: null, miserendId: null, kerknetId: null,
gottesdienstzeitenId: null, horarioDemissaId: parsed.key, misasOrgId: null,
source: 'horario-demissa', website: null, phone: parsed.phone,
address: parsed.address, country: 'BR',
});
}
} catch (error) {
console.error(` Error upserting ${parsed.name}: ${error instanceof Error ? error.message : error}`);
stats.errors++;
}
}
```
- [ ] **Step 3: Add geocodeOnly pass**
Note: `latitude` is non-nullable (`Float` in schema), so `{ latitude: null }` will never match. Use `{ latitude: 0 }` — that is the sentinel value set on creation for address-only churches.
```typescript
async function runGeocodeOnly(stats: ImportStats): Promise<void> {
console.log('\nGeocoding Brazilian churches without coordinates...');
const churches = await prisma.church.findMany({
where: { horarioDemissaId: { not: null }, latitude: 0, address: { not: null } },
select: { id: true, name: true, address: true, city: true, state: true },
});
console.log(`Found ${churches.length} churches to geocode`);
for (const church of churches) {
const coords = await geocodeAddress(church.address!, church.city ?? '', church.state ?? '');
if (coords) {
await prisma.church.update({ where: { id: church.id }, data: { latitude: coords.lat, longitude: coords.lng } });
stats.geocoded++;
console.log(` Geocoded: ${church.name}${coords.lat}, ${coords.lng}`);
} else {
stats.geocodeFailed++;
}
}
}
```
- [ ] **Step 4: Add CLI arg parser + main()**
```typescript
// ─── CLI + Main ───────────────────────────────────────────────────────────────
function parseArgs(): CLIArgs {
const argv = process.argv.slice(2);
const idx = (flag: string) => argv.indexOf(flag);
return {
all: argv.includes('--all'),
state: idx('--state') >= 0 ? argv[idx('--state') + 1] : undefined,
dryRun: argv.includes('--dry-run'),
geocode: argv.includes('--geocode'),
geocodeOnly: argv.includes('--geocode-only'),
resumeFrom: idx('--resume-from') >= 0 ? parseInt(argv[idx('--resume-from') + 1], 10) : undefined,
jobId: idx('--job-id') >= 0 ? argv[idx('--job-id') + 1] : undefined,
};
}
async function main(): Promise<void> {
const args = parseArgs();
const stats: ImportStats = {
citiesProcessed: 0, churchesFound: 0, churchesCreated: 0,
churchesUpdated: 0, massSchedulesCreated: 0,
geocoded: 0, geocodeFailed: 0, errors: 0,
};
console.log('\n' + '='.repeat(70));
console.log('HORARIO DE MISSA (BRAZIL) IMPORTER');
console.log('='.repeat(70));
console.log(`Mode: ${args.geocodeOnly ? 'geocode-only' : args.dryRun ? 'dry-run' : 'import'}`);
if (args.state) console.log(`State filter: ${args.state}`);
if (args.resumeFrom) console.log(`Resume from: ${args.resumeFrom}`);
console.log(`Time: ${new Date().toISOString()}\n`);
try {
if (args.geocodeOnly) {
await runGeocodeOnly(stats);
} else if (args.all || args.state) {
console.log('Loading existing BR churches...');
const existingChurches = await prisma.church.findMany({
where: { country: 'BR' },
select: {
id: true, name: true, latitude: true, longitude: true,
osmId: true, baiduId: true, masstimesId: true, orarimesseId: true,
massSchedulesPhId: true, philmassId: true, horariosMisasId: true,
mszeInfoId: true, weekdayMassesId: true, messesInfoId: true,
bohosluzbyId: true, miserendId: true, kerknetId: true,
gottesdienstzeitenId: true, horarioDemissaId: true, misasOrgId: true,
source: true, website: true, phone: true, address: true, country: true,
},
}) as ExistingChurch[];
console.log(`Loaded ${existingChurches.length} existing BR churches\n`);
const cities = await fetchCityUrls(args.state);
const startIndex = args.resumeFrom ?? 0;
for (let i = startIndex; i < cities.length; i++) {
const { state, city, url } = cities[i];
console.log(`[${i + 1}/${cities.length}] ${state} / ${city}`);
const html = await fetchPage(url);
if (!html) { stats.errors++; continue; }
const churches = parseCityPage(html, city, state);
stats.churchesFound += churches.length;
stats.citiesProcessed++;
console.log(` ${churches.length} churches`);
for (const church of churches) {
await upsertChurch(church, existingChurches, args, stats);
}
if (args.geocode && !args.dryRun) {
for (const church of churches) {
if (!church.address) continue;
const dbChurch = await prisma.church.findUnique({
where: { horarioDemissaId: church.key },
select: { id: true, latitude: true },
});
// latitude === 0 is the sentinel for "no real coordinates yet"
if (dbChurch && dbChurch.latitude === 0) {
const coords = await geocodeAddress(church.address, church.city, church.state);
if (coords) {
await prisma.church.update({ where: { id: dbChurch.id }, data: { latitude: coords.lat, longitude: coords.lng } });
stats.geocoded++;
} else {
stats.geocodeFailed++;
}
}
}
}
}
} else {
console.error('Usage: --all | --state XX | --geocode-only');
process.exit(1);
}
} finally {
await prisma.$disconnect();
await pool.end();
}
console.log('\n' + '='.repeat(70));
console.log('SUMMARY');
console.log('='.repeat(70));
console.log(`Cities processed: ${stats.citiesProcessed}`);
console.log(`Churches found: ${stats.churchesFound}`);
console.log(` Created: ${stats.churchesCreated}`);
console.log(` Updated: ${stats.churchesUpdated}`);
console.log(` Errors: ${stats.errors}`);
console.log(`Mass schedules: ${stats.massSchedulesCreated}`);
if (args.geocode || args.geocodeOnly) {
console.log(`Geocoded: ${stats.geocoded} / Failed: ${stats.geocodeFailed}`);
}
console.log('='.repeat(70) + '\n');
}
main().catch(console.error);
```
- [ ] **Step 5: Test dry-run on small state**
```bash
npx tsx scripts/import-horariodemissa.ts --state DF --dry-run
```
Expected: Lists churches from Distrito Federal (Brasília) without DB writes.
- [ ] **Step 6: Test real import on smallest state (Roraima)**
```bash
npx tsx scripts/import-horariodemissa.ts --state RR
```
Then verify:
```bash
npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
import { prisma } from './src/lib/db.ts';
const count = await prisma.church.count({ where: { country: 'BR' } });
const sched = await prisma.massSchedule.count({ where: { church: { country: 'BR' } } });
console.log('BR churches:', count, '| Mass schedules:', sched);
await prisma.\$disconnect();
"
```
Expected: Some churches from Roraima with mass schedules in DB.
- [ ] **Step 7: Commit**
```bash
git add scripts/import-horariodemissa.ts
git commit -m "feat: complete horariodemissa importer (Brazil, 8895 churches + 28523 mass times)"
```
---
## Chunk 3: Spain Importer (import-misas.ts)
> Depends on Chunk 1. Can run in parallel with Chunk 2.
### Task 6: API pagination + boilerplate
**Files:**
- Create: `scripts/import-misas.ts`
- [ ] **Step 1: Create script with boilerplate + API pagination**
Create `scripts/import-misas.ts`:
```typescript
#!/usr/bin/env tsx
/**
* Import Catholic churches from misas.org (Spain)
*
* misas.org lists 17,919 Spanish parishes with name, address, coordinates,
* and province via a public JSON REST API. Mass schedules are auth-gated
* (401 on detail endpoint), so this importer creates/updates church records
* only — no schedule data.
*
* The listing API accepts offset-based pagination. We use Madrid as the center
* with a large radius (999999m) to cover all of Spain in a single stream.
*
* Import strategy:
* 1. Paginate GET /api/parishsearch?country=es&pos=[...]&offset=N&limit=500
* 2. For each parish: id, name, addr, loc (city), prov (province), zip, lat, long
* 3. Match against existing ES churches by misasOrgId or proximity+name
* 4. Upsert church record (no mass schedules)
*
* Usage:
* npx tsx scripts/import-misas.ts --all
* npx tsx scripts/import-misas.ts --all --dry-run
* npx tsx scripts/import-misas.ts --all --resume-from 5000
* npx tsx scripts/import-misas.ts --all --job-id {uuid}
*/
import dotenv from 'dotenv';
import path from 'path';
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
connectionString: dbUrl,
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });
import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';
// ─── Constants ───────────────────────────────────────────────────────────────
const API_BASE = 'https://misas.org/api/parishsearch';
// Madrid coordinates, large radius covers all of Spain
const SPAIN_POS = encodeURIComponent('[-3.7038,40.4168,999999]');
const PAGE_SIZE = 500;
const REQUEST_DELAY_MS = 500;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
// ─── Types ───────────────────────────────────────────────────────────────────
interface MisasParish {
id: number;
name: string;
uri: string;
addr: string;
loc: string; // city
prov: string; // province
zip: string;
lat: string;
long: string;
}
interface MisasApiResponse {
count: number;
pars: MisasParish[];
}
interface CLIArgs {
all: boolean;
dryRun: boolean;
resumeFrom?: number;
jobId?: string;
}
interface ImportStats {
total: number;
created: number;
updated: number;
errors: number;
}
// ─── HTTP Client ──────────────────────────────────────────────────────────────
let requestCount = 0;
function delay(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function fetchParishes(offset: number): Promise<MisasApiResponse | null> {
if (requestCount > 0) await delay(REQUEST_DELAY_MS);
requestCount++;
const url = `${API_BASE}?country=es&pos=${SPAIN_POS}&offset=${offset}&limit=${PAGE_SIZE}`;
try {
const response = await fetch(url, {
headers: {
'User-Agent': USER_AGENT,
'Accept': 'application/json',
'Referer': 'https://misas.org/',
},
});
if (!response.ok) {
console.error(` HTTP ${response.status} at offset ${offset}`);
return null;
}
return await response.json() as MisasApiResponse;
} catch (error) {
console.error(` Fetch error at offset ${offset}: ${error instanceof Error ? error.message : error}`);
return null;
}
}
// ─── Pagination ───────────────────────────────────────────────────────────────
export async function* paginateParishes(startOffset: number = 0): AsyncGenerator<MisasParish> {
let offset = startOffset;
let totalKnown = Infinity;
while (offset < totalKnown) {
console.log(` Fetching offset ${offset}${totalKnown < Infinity ? `/${totalKnown}` : ''}...`);
const data = await fetchParishes(offset);
if (!data || !data.pars || data.pars.length === 0) break;
totalKnown = data.count;
for (const parish of data.pars) {
yield parish;
}
offset += data.pars.length;
}
}
```
- [ ] **Step 2: Verify API returns expected data**
```bash
npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
const { paginateParishes } = await import('./scripts/import-misas.ts');
let count = 0;
for await (const p of paginateParishes()) {
if (count === 0) console.log('First parish:', JSON.stringify(p, null, 2));
count++;
if (count >= 5) break;
}
console.log('Fetched:', count, 'from first batch');
"
```
Expected: Parish objects with id, name, lat, long, addr, loc, prov fields.
- [ ] **Step 3: Commit**
```bash
git add scripts/import-misas.ts
git commit -m "feat: misas.org importer scaffold + API pagination"
```
---
### Task 7: DB upsert + main()
**Files:**
- Modify: `scripts/import-misas.ts`
- [ ] **Step 1: Add upsertParish + main()**
Note: `latitude`/`longitude` are `Float` (non-nullable) — use `0` as sentinel when coordinates are missing. Set `source` explicitly to `'misas-org'` — the schema default is `"masstimes"`.
```typescript
// ─── DB Upsert ────────────────────────────────────────────────────────────────
async function upsertParish(
parish: MisasParish,
existingChurches: ExistingChurch[],
args: CLIArgs,
stats: ImportStats
): Promise<void> {
const lat = parseFloat(parish.lat);
const lng = parseFloat(parish.long);
const misasOrgId = String(parish.id);
const resolvedLat = isNaN(lat) ? 0 : lat;
const resolvedLng = isNaN(lng) ? 0 : lng;
const candidate = {
name: parish.name,
lat: resolvedLat,
lng: resolvedLng,
misasOrgId,
};
const existing = findDuplicateChurch(candidate, existingChurches);
if (args.dryRun) {
console.log(` [dry-run] ${existing ? 'UPDATE' : 'CREATE'} ${parish.name} (${misasOrgId})`);
stats.total++;
if (existing) stats.updated++; else stats.created++;
return;
}
try {
const church = await prisma.church.upsert({
where: { misasOrgId },
create: {
misasOrgId,
name: parish.name,
address: parish.addr || null,
city: parish.loc || null,
state: parish.prov || null,
zip: parish.zip || null,
country: 'ES',
source: 'misas-org', // must set explicitly — schema default is "masstimes"
latitude: resolvedLat, // 0 = no real coordinates; misas.org provides coords for most
longitude: resolvedLng,
lastScrapedAt: new Date(),
scrapeStrategy: 'misas-org',
},
update: {
name: parish.name,
address: parish.addr || undefined,
city: parish.loc || undefined,
state: parish.prov || undefined,
zip: parish.zip || undefined,
// Only update coords if we have real values (don't overwrite good data with 0)
...(resolvedLat !== 0 && { latitude: resolvedLat, longitude: resolvedLng }),
misasOrgId, // stamp ID even if matched by proximity
lastScrapedAt: new Date(),
},
});
if (existing) {
stats.updated++;
} else {
stats.created++;
existingChurches.push({
id: church.id, name: parish.name,
latitude: resolvedLat, longitude: resolvedLng,
osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
bohosluzbyId: null, miserendId: null, kerknetId: null,
gottesdienstzeitenId: null, horarioDemissaId: null, misasOrgId,
source: 'misas-org', website: null, phone: null,
address: parish.addr || null, country: 'ES',
});
}
stats.total++;
} catch (error) {
console.error(` Error upserting ${parish.name}: ${error instanceof Error ? error.message : error}`);
stats.errors++;
stats.total++; // count errors in total so progress log fires correctly
}
}
// ─── CLI + Main ───────────────────────────────────────────────────────────────
// Note: --job-id is accepted for scheduler compatibility but BackgroundJob status
// tracking is not wired up in this importer (acceptable for v1 — add later if needed).
function parseArgs(): CLIArgs {
const argv = process.argv.slice(2);
const idx = (flag: string) => argv.indexOf(flag);
return {
all: argv.includes('--all'),
dryRun: argv.includes('--dry-run'),
resumeFrom: idx('--resume-from') >= 0 ? parseInt(argv[idx('--resume-from') + 1], 10) : undefined,
jobId: idx('--job-id') >= 0 ? argv[idx('--job-id') + 1] : undefined,
};
}
async function main(): Promise<void> {
const args = parseArgs();
const stats: ImportStats = { total: 0, created: 0, updated: 0, errors: 0 };
console.log('\n' + '='.repeat(70));
console.log('MISAS.ORG (SPAIN) IMPORTER');
console.log('='.repeat(70));
console.log(`Mode: ${args.dryRun ? 'dry-run' : 'import'}`);
if (args.resumeFrom) console.log(`Resume from offset: ${args.resumeFrom}`);
console.log(`Time: ${new Date().toISOString()}\n`);
if (!args.all) {
console.error('Usage: --all [--dry-run] [--resume-from N]');
process.exit(1);
}
try {
console.log('Loading existing ES churches...');
const existingChurches = await prisma.church.findMany({
where: { country: 'ES' },
select: {
id: true, name: true, latitude: true, longitude: true,
osmId: true, baiduId: true, masstimesId: true, orarimesseId: true,
massSchedulesPhId: true, philmassId: true, horariosMisasId: true,
mszeInfoId: true, weekdayMassesId: true, messesInfoId: true,
bohosluzbyId: true, miserendId: true, kerknetId: true,
gottesdienstzeitenId: true, horarioDemissaId: true, misasOrgId: true,
source: true, website: true, phone: true, address: true, country: true,
},
}) as ExistingChurch[];
console.log(`Loaded ${existingChurches.length} existing ES churches\n`);
for await (const parish of paginateParishes(args.resumeFrom ?? 0)) {
await upsertParish(parish, existingChurches, args, stats);
if (stats.total % 500 === 0) {
console.log(` Progress: ${stats.total} processed (${stats.created} created, ${stats.updated} updated)`);
}
}
} finally {
await prisma.$disconnect();
await pool.end();
}
console.log('\n' + '='.repeat(70));
console.log('SUMMARY');
console.log('='.repeat(70));
console.log(`Total processed: ${stats.total}`);
console.log(` Created: ${stats.created}`);
console.log(` Updated: ${stats.updated}`);
console.log(` Errors: ${stats.errors}`);
console.log('='.repeat(70) + '\n');
}
main().catch(console.error);
```
- [ ] **Step 2: Test dry-run end-to-end**
```bash
npx tsx scripts/import-misas.ts --all --dry-run 2>&1 | tail -20
```
Expected: Processes all 17,919 parishes, shows `Total processed: 17919` with created/updated split.
- [ ] **Step 3: Commit**
```bash
git add scripts/import-misas.ts
git commit -m "feat: complete misas.org importer (Spain, 17919 churches with coordinates)"
```
---
## Chunk 4: Integration
### Task 8: package.json + scheduler
**Files:**
- Modify: `package.json`
- Modify: `scripts/scheduler.ts`
- [ ] **Step 1: Add npm scripts**
In `package.json` `"scripts"` block, add after `"import:masstimes-api"`:
```json
"import:horariodemissa": "tsx scripts/import-horariodemissa.ts",
"import:misas": "tsx scripts/import-misas.ts"
```
- [ ] **Step 2: Add getJobCommand cases in scheduler.ts**
In `scripts/scheduler.ts`, add before `default:` in `getJobCommand()`:
```typescript
case 'horariodemissa-import': {
const args = ['tsx', 'scripts/import-horariodemissa.ts', '--all'];
if (config?.state) args.push('--state', String(config.state));
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
if (config?.geocode) args.push('--geocode');
return { command: 'npx', args };
}
case 'misas-import': {
const args = ['tsx', 'scripts/import-misas.ts', '--all'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
```
- [ ] **Step 3: Add to PIPELINE_GROUPS imports sequence**
In `PIPELINE_GROUPS[0].phases`, add after the `masstimes-api-import` entry:
```typescript
{ name: 'horariodemissa-import', type: 'horariodemissa-import', config: {} },
{ name: 'misas-import', type: 'misas-import', config: {} },
```
- [ ] **Step 4: Verify TypeScript**
```bash
npx tsc --noEmit
```
Expected: no errors.
- [ ] **Step 5: Smoke test both npm scripts**
```bash
npm run import:horariodemissa -- --state DF --dry-run 2>&1 | tail -10
npm run import:misas -- --all --dry-run 2>&1 | tail -10
```
- [ ] **Step 6: Commit**
```bash
git add package.json scripts/scheduler.ts
git commit -m "feat: add horariodemissa and misas.org to npm scripts and scheduler pipeline"
```
---
## Final Verification
- [ ] **Import small state from Brazil to confirm end-to-end**
```bash
npx tsx scripts/import-horariodemissa.ts --state DF
```
```bash
npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
import { prisma } from './src/lib/db.ts';
const churches = await prisma.church.count({ where: { country: 'BR' } });
const schedules = await prisma.massSchedule.count({ where: { church: { country: 'BR' } } });
console.log('BR churches:', churches, '| Mass schedules:', schedules);
await prisma.\$disconnect();
"
```
Expected: Distrito Federal churches in DB with mass schedules.
- [ ] **Dry-run Spain importer full pass**
```bash
npx tsx scripts/import-misas.ts --all --dry-run 2>&1 | grep -E "SUMMARY|Total|Created|Updated" | tail -10
```
Expected: ~17,919 total, mix of created vs updated depending on existing ES church overlap.