Files
ScraperControl/docs/superpowers/plans/2026-03-10-brazil-spain-importers.md
albertfj114 0e468bcb94 docs: add Brazil + Spain importers design spec and implementation plan
Two new importers:
- horariodemissa.com.br: 8,895 Brazilian churches + 28,523 mass times
- misas.org: 17,919 Spanish churches with coordinates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 19:50:54 -04:00

46 KiB
Raw Blame History

Brazil + Spain Importers Implementation Plan

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add two new church importers — horariodemissa.com.br (8,895 Brazilian churches + 28,523 mass times) and misas.org (17,919 Spanish churches with coordinates).

Architecture: Chunk 1 (shared prerequisites) must complete first. Tasks 35 (Brazil) and Tasks 67 (Spain) are independent and can run in parallel as subagents. All scripts follow the established importer pattern: fetch → regex parse → church-matcher dedup → prisma upsert.

Tech Stack: TypeScript, tsx, native fetch, regex HTML parsing (matchAll), Prisma + pg, church-matcher

Spec: docs/superpowers/specs/2026-03-10-brazil-spain-importers-design.md


Chunk 1: Shared Prerequisites (schema + church-matcher)

Task 1: Schema additions

Files:

  • Modify: prisma/schema.prisma

  • Step 1: Add two new ID fields to the Church model

In prisma/schema.prisma, find the block of importer ID fields (near gottesdienstzeitenId) and add after it:

horarioDemissaId      String?   @unique @map("horario_demissa_id")
misasOrgId            String?   @unique @map("misas_org_id")

Then add two indexes in the @@index block at the bottom of the Church model:

@@index([horarioDemissaId])
@@index([misasOrgId])
  • Step 2: Regenerate Prisma client
npx prisma generate

Expected: ✔ Generated Prisma Client with no errors.

  • Step 3: Verify the fields exist in generated types
grep -n "horarioDemissaId\|misasOrgId" node_modules/.prisma/client/index.d.ts | head -10

Expected: both fields appear in the type definitions.

  • Step 4: Commit
git add prisma/schema.prisma
git commit -m "feat: add horarioDemissaId and misasOrgId fields to Church schema"

Task 2: church-matcher updates

Files:

  • Modify: src/lib/church-matcher.ts

  • Step 1: Add new fields to ExistingChurch interface

In src/lib/church-matcher.ts, find ExistingChurch interface and add after gottesdienstzeitenId:

horarioDemissaId: string | null;
misasOrgId: string | null;
  • Step 2: Add new fields to ChurchCandidate type

Find ChurchCandidate type and add after gottesdienstzeitenId?:

horarioDemissaId?: string;
misasOrgId?: string;
  • Step 3: Add two new exact-match passes in findDuplicateChurch

After the Thirteenth pass (gottesdienstzeitenId), add before the proximity pass:

  // Fourteenth pass: exact horarioDemissaId match
  if (candidate.horarioDemissaId) {
    const match = existingChurches.find(
      (church) => church.horarioDemissaId === candidate.horarioDemissaId
    );
    if (match) return match;
  }

  // Fifteenth pass: exact misasOrgId match
  if (candidate.misasOrgId) {
    const match = existingChurches.find(
      (church) => church.misasOrgId === candidate.misasOrgId
    );
    if (match) return match;
  }
  • Step 4: Verify TypeScript compiles
npx tsc --noEmit

Expected: no errors.

  • Step 5: Commit
git add src/lib/church-matcher.ts
git commit -m "feat: add horarioDemissaId and misasOrgId to church-matcher"

Chunk 2: Brazil Importer (import-horariodemissa.ts)

Depends on Chunk 1. Can run in parallel with Chunk 3.

Task 3: Boilerplate + sitemap enumeration

Files:

  • Create: scripts/import-horariodemissa.ts

  • Step 1: Create script with boilerplate + types + sitemap parsing

Create scripts/import-horariodemissa.ts:

#!/usr/bin/env tsx
/**
 * Import Catholic churches and mass schedules from horariodemissa.com.br (Brazil)
 *
 * horariodemissa.com.br has 8,895 churches across all 26 Brazilian states + DF,
 * with 28,523 mass times. All data is server-rendered — one HTTP request per city
 * page returns all churches + schedules for that city.
 *
 * City pages have a split structure:
 *   - Address/phone: embedded in JS h.push() strings (sidebar/map data)
 *   - Schedules: in server-rendered .result divs with <table> rows
 *   Both sets are linked by the same church key (e.g. "dvey2").
 *
 * Import strategy:
 *   1. Fetch sitemap.xml → deduplicate to pt-only city URLs (~3,552 cities)
 *   2. For each city: fetch page → parse address/phone from JS + schedules from DOM
 *   3. Join by church key, match against existing BR churches, upsert
 *   4. Optional --geocode flag for Nominatim pass after import
 *
 * Usage:
 *   npx tsx scripts/import-horariodemissa.ts --all
 *   npx tsx scripts/import-horariodemissa.ts --all --dry-run
 *   npx tsx scripts/import-horariodemissa.ts --state SP
 *   npx tsx scripts/import-horariodemissa.ts --all --resume-from 500
 *   npx tsx scripts/import-horariodemissa.ts --all --geocode
 *   npx tsx scripts/import-horariodemissa.ts --geocode-only
 *   npx tsx scripts/import-horariodemissa.ts --all --job-id {uuid}
 */

import dotenv from 'dotenv';
import path from 'path';

dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });

import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';

const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
  connectionString: dbUrl,
  ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });

import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';

// ─── Constants ───────────────────────────────────────────────────────────────

const SITE_BASE = 'https://horariodemissa.com.br';
const SITEMAP_URL = `${SITE_BASE}/sitemap.xml`;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 1500;
const NOMINATIM_DELAY_MS = 1100;
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';

// ─── Types ───────────────────────────────────────────────────────────────────

interface CityUrl {
  state: string;  // e.g. "SP"
  city: string;   // e.g. "São Paulo"
  url: string;    // full fetch URL
}

interface ParsedSchedule {
  dayOfWeek: number;    // 0=Sun, 1=Mon, ..., 6=Sat
  time: string;         // "HH:MM"
  notes: string | null;
}

interface ParsedConfession {
  dayOfWeek: number;
  startTime: string;
  endTime: string;
  notes: string | null;
}

interface ParsedChurch {
  key: string;      // e.g. "dvey2" (used as horarioDemissaId)
  name: string;
  address: string | null;
  phone: string | null;
  city: string;
  state: string;
  massSchedules: ParsedSchedule[];
  confessionSchedules: ParsedConfession[];
}

interface CLIArgs {
  all: boolean;
  state?: string;
  dryRun: boolean;
  geocode: boolean;
  geocodeOnly: boolean;
  resumeFrom?: number;
  jobId?: string;
}

interface ImportStats {
  citiesProcessed: number;
  churchesFound: number;
  churchesCreated: number;
  churchesUpdated: number;
  massSchedulesCreated: number;
  geocoded: number;
  geocodeFailed: number;
  errors: number;
}

// ─── Brazilian Day Name Mapping ───────────────────────────────────────────────

const DAY_MAP: Record<string, number> = {
  'domingo': 0,
  'segunda-feira': 1, 'segunda': 1,
  'terça-feira': 2, 'terca-feira': 2, 'terça': 2,
  'quarta-feira': 3, 'quarta': 3,
  'quinta-feira': 4, 'quinta': 4,
  'sexta-feira': 5, 'sexta': 5,
  'sábado': 6, 'sabado': 6,
};

const SPECIAL_DAY_MAP: Record<string, { dayOfWeek: number; notes: string }> = {
  'primeiro domingo': { dayOfWeek: 0, notes: 'Primeiro Domingo' },
  'segundo domingo': { dayOfWeek: 0, notes: 'Segundo Domingo' },
  'terceiro domingo': { dayOfWeek: 0, notes: 'Terceiro Domingo' },
  'quarto domingo': { dayOfWeek: 0, notes: 'Quarto Domingo' },
  'primeiro sábado': { dayOfWeek: 6, notes: 'Primeiro Sábado' },
  'primeiro sabado': { dayOfWeek: 6, notes: 'Primeiro Sábado' },
  'segundo sábado': { dayOfWeek: 6, notes: 'Segundo Sábado' },
  'segundo sabado': { dayOfWeek: 6, notes: 'Segundo Sábado' },
};

// ─── HTTP Client ──────────────────────────────────────────────────────────────

let requestCount = 0;

function delay(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function fetchPage(url: string, delayMs: number = REQUEST_DELAY_MS): Promise<string | null> {
  if (requestCount > 0) await delay(delayMs);
  requestCount++;

  try {
    const response = await fetch(url, {
      headers: {
        'User-Agent': USER_AGENT,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'pt-BR,pt;q=0.9',
      },
    });

    if (!response.ok) {
      console.error(`  HTTP ${response.status} for ${url}`);
      return null;
    }

    return await response.text();
  } catch (error) {
    console.error(`  Fetch error for ${url}: ${error instanceof Error ? error.message : error}`);
    return null;
  }
}

// ─── Sitemap Parser ───────────────────────────────────────────────────────────

export function parseCityUrlsFromSitemap(sitemapXml: string, filterState?: string): CityUrl[] {
  const seen = new Set<string>();
  const cities: CityUrl[] = [];

  for (const match of sitemapXml.matchAll(/<loc>([^<]+)<\/loc>/g)) {
    const rawUrl = match[1].replace(/&amp;/g, '&');

    // Only pt-language city search pages
    if (!rawUrl.includes('opcoes=cidade_opcoes') || rawUrl.includes('hl=en')) continue;

    const ufMatch = rawUrl.match(/[?&]uf=([A-Z]+)/);
    const cidadeMatch = rawUrl.match(/[?&]cidade=([^&]+)/);
    if (!ufMatch || !cidadeMatch) continue;

    const state = ufMatch[1];
    const city = decodeURIComponent(cidadeMatch[1].replace(/\+/g, ' '));

    if (filterState && state !== filterState.toUpperCase()) continue;

    const key = `${state}:${city}`;
    if (seen.has(key)) continue;
    seen.add(key);

    cities.push({ state, city, url: rawUrl });
  }

  cities.sort((a, b) => a.state.localeCompare(b.state) || a.city.localeCompare(b.city));
  return cities;
}

async function fetchCityUrls(filterState?: string): Promise<CityUrl[]> {
  console.log(`Fetching sitemap: ${SITEMAP_URL}`);
  const xml = await fetchPage(SITEMAP_URL);
  if (!xml) throw new Error('Failed to fetch sitemap');

  const cities = parseCityUrlsFromSitemap(xml, filterState);
  console.log(`Found ${cities.length} unique cities${filterState ? ` in ${filterState}` : ''}`);
  return cities;
}
  • Step 2: Verify sitemap parsing works
npx tsx -e "
import dotenv from 'dotenv';
dotenv.config({ path: '.env' });
const { parseCityUrlsFromSitemap } = await import('./scripts/import-horariodemissa.ts');
const xml = await fetch('https://horariodemissa.com.br/sitemap.xml').then(r => r.text());
const cities = parseCityUrlsFromSitemap(xml);
console.log('Total cities:', cities.length);
console.log('Sample:', JSON.stringify(cities.slice(0, 3), null, 2));
const states = [...new Set(cities.map(c => c.state))].sort();
console.log('States:', states.join(', '));
"

Expected: ~3,500 cities, states include SP, RJ, MG, RS, BA, DF, etc.

  • Step 3: Commit
git add scripts/import-horariodemissa.ts
git commit -m "feat: horariodemissa importer scaffold + sitemap enumeration"

Task 4: HTML parsing

Files:

  • Modify: scripts/import-horariodemissa.ts

  • Step 1: Understand the dual-source page structure

Each city page contains two data sources per church, joined by the same key (e.g. dvey2):

Source A — JS h.push() strings embedded in <script> (sidebar/map):

h.push('<p><strong><a href="igreja.php?k=dvey2">NAME</a></strong><br/>Rua X, 123</p><p><strong>Telefone:</strong> (11) 1234-5678</p>');

Contains: key, name, address, phone.

Source B — Server-rendered .result divs:

<div class="result">
  <a href="igreja.php?k=dvey2" class="result_title">NAME</a>
  <p class="blockleft"><table>
    <tr><td style="...">Domingo:</td><td>07:30, 10:30</td></tr>
  </table></p>
</div>

Contains: key + schedule tables (first = masses, optional second = confessions).

  • Step 2: Add parseDayLabel, parseTimeCells, parseMassTable, parseConfessionTable
// ─── HTML Parsers ─────────────────────────────────────────────────────────────

export function parseDayLabel(label: string): { dayOfWeek: number; notes: string | null } | null {
  const normalized = label.toLowerCase().replace(/:$/, '').trim();

  if (SPECIAL_DAY_MAP[normalized]) {
    const s = SPECIAL_DAY_MAP[normalized];
    return { dayOfWeek: s.dayOfWeek, notes: s.notes };
  }

  if (DAY_MAP[normalized] !== undefined) {
    return { dayOfWeek: DAY_MAP[normalized], notes: null };
  }

  return null;
}

export function parseTimeCells(timesText: string): Array<{ time: string; notes: string | null }> {
  const results: Array<{ time: string; notes: string | null }> = [];

  // Split by comma but not inside parentheses
  const parts = timesText.split(/,(?![^(]*\))/);

  for (const part of parts) {
    const trimmed = part.trim();
    if (!trimmed) continue;

    const timeMatch = trimmed.match(/\b(\d{1,2}:\d{2})\b/);
    if (!timeMatch) continue;

    const [h, m] = timeMatch[1].split(':');
    const time = `${h.padStart(2, '0')}:${m}`;

    const notesMatch = trimmed.match(/\(([^)]+)\)/);
    results.push({ time, notes: notesMatch ? notesMatch[1].trim() : null });
  }

  return results;
}

export function parseMassTable(tableHtml: string): ParsedSchedule[] {
  const schedules: ParsedSchedule[] = [];

  for (const rowMatch of tableHtml.matchAll(/<tr[^>]*>([\s\S]*?)<\/tr>/gi)) {
    const tds = [...rowMatch[1].matchAll(/<td[^>]*>([\s\S]*?)<\/td>/gi)]
      .map(m => m[1].replace(/<[^>]+>/g, '').trim());

    if (tds.length < 2) continue;

    const dayResult = parseDayLabel(tds[0]);
    if (!dayResult) continue;

    for (const { time, notes } of parseTimeCells(tds[1])) {
      schedules.push({
        dayOfWeek: dayResult.dayOfWeek,
        time,
        notes: [dayResult.notes, notes].filter(Boolean).join('; ') || null,
      });
    }
  }

  return schedules;
}

export function parseConfessionTable(tableHtml: string): ParsedConfession[] {
  const confessions: ParsedConfession[] = [];

  for (const rowMatch of tableHtml.matchAll(/<tr[^>]*>([\s\S]*?)<\/tr>/gi)) {
    const tds = [...rowMatch[1].matchAll(/<td[^>]*>([\s\S]*?)<\/td>/gi)]
      .map(m => m[1].replace(/<[^>]+>/g, '').trim());

    if (tds.length < 2) continue;

    const dayResult = parseDayLabel(tds[0]);
    if (!dayResult) continue;

    // "09:00 às 11:00" or "09:00 a 11:00"
    const rangeMatch = tds[1].match(/(\d{1,2}:\d{2})\s+(?:às|a)\s+(\d{1,2}:\d{2})/i);
    if (!rangeMatch) continue;

    const pad = (t: string) => { const [hh, mm] = t.split(':'); return `${hh.padStart(2,'0')}:${mm}`; };
    confessions.push({
      dayOfWeek: dayResult.dayOfWeek,
      startTime: pad(rangeMatch[1]),
      endTime: pad(rangeMatch[2]),
      notes: dayResult.notes,
    });
  }

  return confessions;
}

/**
 * Parse a full city page HTML into church records.
 * Joins h.push() JS data (name/address/phone) with .result DOM (schedules) by church key.
 */
export function parseCityPage(html: string, city: string, state: string): ParsedChurch[] {
  // Parse Source A: h.push() JS strings → name, address, phone
  const jsData = new Map<string, { name: string; address: string | null; phone: string | null }>();

  for (const pushMatch of html.matchAll(/h\.push\('([\s\S]*?)'\);/g)) {
    const content = pushMatch[1].replace(/\\'/g, "'");

    const keyMatch = content.match(/igreja\.php\?k=([a-zA-Z0-9]+)/);
    if (!keyMatch) continue;

    const nameMatch = content.match(/igreja\.php\?k=[^"]+">([^<]+)<\/a>/);
    const addrMatch = content.match(/<br\/>([^<]+)<\/p>/);
    const phoneMatch = content.match(/Telefone:<\/strong>\s*([^<]+)/);

    jsData.set(keyMatch[1], {
      name: nameMatch ? nameMatch[1].trim() : '',
      address: addrMatch ? addrMatch[1].trim() || null : null,
      phone: phoneMatch ? phoneMatch[1].trim() || null : null,
    });
  }

  // Parse Source B: .result divs → schedules
  // Use split() rather than a lookahead regex — lookahead with $ drops the last result div
  const scheduleData = new Map<string, { massSchedules: ParsedSchedule[]; confessionSchedules: ParsedConfession[] }>();

  const resultParts = html.split('<div class="result">');
  for (let i = 1; i < resultParts.length; i++) {
    const resultHtml = resultParts[i];

    const keyMatch = resultHtml.match(/href="igreja\.php\?k=([a-zA-Z0-9]+)"/);
    if (!keyMatch) continue;

    const tables = [...resultHtml.matchAll(/<table>([\s\S]*?)<\/table>/g)].map(m => m[1]);
    scheduleData.set(keyMatch[1], {
      massSchedules: tables[0] ? parseMassTable(tables[0]) : [],
      confessionSchedules: tables[1] ? parseConfessionTable(tables[1]) : [],
    });
  }

  // Join both sources by church key — every church in jsData gets its schedules from scheduleData
  const allKeys = new Set([...jsData.keys(), ...scheduleData.keys()]);
  const churches: ParsedChurch[] = [];

  for (const key of allKeys) {
    const js = jsData.get(key);
    const sched = scheduleData.get(key);
    if (!js?.name) continue;

    churches.push({
      key,
      name: js.name,
      address: js.address,
      phone: js.phone,
      city,
      state,
      massSchedules: sched?.massSchedules ?? [],
      confessionSchedules: sched?.confessionSchedules ?? [],
    });
  }

  return churches;
}
  • Step 3: Verify parsing against a live city page
npx tsx -e "
import dotenv from 'dotenv';
dotenv.config({ path: '.env' });
const { parseCityPage } = await import('./scripts/import-horariodemissa.ts');
const url = 'https://horariodemissa.com.br/search.php?uf=SP&cidade=S%C3%A3o+Paulo&bairro=&opcoes=cidade_opcoes&submit=12345678&hl=pt';
const html = await fetch(url, { headers: { 'User-Agent': 'NearestMass-Importer/1.0' } }).then(r => r.text());
const churches = parseCityPage(html, 'São Paulo', 'SP');
console.log('Churches found:', churches.length);
console.log('With schedules:', churches.filter(c => c.massSchedules.length > 0).length);
console.log('Sample:', JSON.stringify(churches[0], null, 2));
"

Expected: 20+ churches found, majority with mass schedules, first entry shows name/address/phone/schedules.

  • Step 4: Commit
git add scripts/import-horariodemissa.ts
git commit -m "feat: horariodemissa HTML parser (day mapping, schedule tables, dual-source join)"

Task 5: DB upsert + main()

Files:

  • Modify: scripts/import-horariodemissa.ts

  • Step 1: Add geocode helper

// ─── Geocoding ────────────────────────────────────────────────────────────────

async function geocodeAddress(address: string, city: string, state: string): Promise<{ lat: number; lng: number } | null> {
  const query = [address, city, state, 'Brasil'].filter(Boolean).join(', ');
  const url = `${NOMINATIM_URL}?q=${encodeURIComponent(query)}&format=json&limit=1&countrycodes=br`;
  await delay(NOMINATIM_DELAY_MS);

  try {
    const response = await fetch(url, {
      headers: { 'User-Agent': USER_AGENT, 'Accept': 'application/json' },
    });
    if (!response.ok) return null;
    const results = await response.json() as Array<{ lat: string; lon: string }>;
    if (!results.length) return null;
    return { lat: parseFloat(results[0].lat), lng: parseFloat(results[0].lon) };
  } catch {
    return null;
  }
}
  • Step 2: Add upsertChurch function

Note: latitude/longitude are non-nullable in the schema. Use 0 as the sentinel for "no coordinates yet" (geocode pass will fill these in). The source field must be set explicitly — the schema default is "masstimes" which would corrupt source-based queries.

// ─── DB Upsert ────────────────────────────────────────────────────────────────

async function upsertChurch(
  parsed: ParsedChurch,
  existingChurches: ExistingChurch[],
  args: CLIArgs,
  stats: ImportStats
): Promise<void> {
  const candidate = { name: parsed.name, lat: 0, lng: 0, horarioDemissaId: parsed.key };
  const existing = findDuplicateChurch(candidate, existingChurches);

  if (args.dryRun) {
    console.log(`  [dry-run] ${existing ? 'UPDATE' : 'CREATE'} ${parsed.name} (${parsed.key})`);
    if (existing) stats.churchesUpdated++; else stats.churchesCreated++;
    return;
  }

  try {
    let churchId: string;

    await prisma.$transaction(async (tx) => {
      const church = await tx.church.upsert({
        where: { horarioDemissaId: parsed.key },
        create: {
          horarioDemissaId: parsed.key,
          name: parsed.name,
          address: parsed.address,
          city: parsed.city,
          state: parsed.state,
          country: 'BR',
          phone: parsed.phone,
          source: 'horario-demissa',  // must set explicitly — schema default is "masstimes"
          latitude: 0,                // sentinel for "no coordinates"; geocode pass fills this in
          longitude: 0,
          lastScrapedAt: new Date(),
          scrapeStrategy: 'horario-demissa',
        },
        update: {
          name: parsed.name,
          address: parsed.address ?? undefined,
          city: parsed.city,
          state: parsed.state,
          phone: parsed.phone ?? undefined,
          lastScrapedAt: new Date(),
        },
      });
      churchId = church.id;

      await tx.massSchedule.deleteMany({ where: { churchId: church.id } });

      if (parsed.massSchedules.length > 0) {
        // Deduplicate by day+time before inserting
        const seen = new Set<string>();
        const deduped = parsed.massSchedules.filter((s) => {
          const k = `${s.dayOfWeek}:${s.time}`;
          return seen.has(k) ? false : (seen.add(k), true);
        });

        await tx.massSchedule.createMany({
          data: deduped.map((s) => ({
            churchId: church.id,
            dayOfWeek: s.dayOfWeek,
            time: s.time,
            notes: s.notes,
          })),
        });
        stats.massSchedulesCreated += deduped.length;
      }

      await tx.confessionSchedule.deleteMany({ where: { churchId: church.id } });
      if (parsed.confessionSchedules.length > 0) {
        await tx.confessionSchedule.createMany({
          data: parsed.confessionSchedules.map((c) => ({
            churchId: church.id,
            dayOfWeek: c.dayOfWeek,
            startTime: c.startTime,
            endTime: c.endTime,
            notes: c.notes,
          })),
        });
      }
    });

    if (existing) {
      stats.churchesUpdated++;
    } else {
      stats.churchesCreated++;
      // Use real DB UUID (churchId!) not the source key string
      existingChurches.push({
        id: churchId!, name: parsed.name, latitude: 0, longitude: 0,
        osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
        massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
        mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
        bohosluzbyId: null, miserendId: null, kerknetId: null,
        gottesdienstzeitenId: null, horarioDemissaId: parsed.key, misasOrgId: null,
        source: 'horario-demissa', website: null, phone: parsed.phone,
        address: parsed.address, country: 'BR',
      });
    }
  } catch (error) {
    console.error(`  Error upserting ${parsed.name}: ${error instanceof Error ? error.message : error}`);
    stats.errors++;
  }
}
  • Step 3: Add geocodeOnly pass

Note: latitude is non-nullable (Float in schema), so { latitude: null } will never match. Use { latitude: 0 } — that is the sentinel value set on creation for address-only churches.

async function runGeocodeOnly(stats: ImportStats): Promise<void> {
  console.log('\nGeocoding Brazilian churches without coordinates...');
  const churches = await prisma.church.findMany({
    where: { horarioDemissaId: { not: null }, latitude: 0, address: { not: null } },
    select: { id: true, name: true, address: true, city: true, state: true },
  });
  console.log(`Found ${churches.length} churches to geocode`);

  for (const church of churches) {
    const coords = await geocodeAddress(church.address!, church.city ?? '', church.state ?? '');
    if (coords) {
      await prisma.church.update({ where: { id: church.id }, data: { latitude: coords.lat, longitude: coords.lng } });
      stats.geocoded++;
      console.log(`  Geocoded: ${church.name}${coords.lat}, ${coords.lng}`);
    } else {
      stats.geocodeFailed++;
    }
  }
}
  • Step 4: Add CLI arg parser + main()
// ─── CLI + Main ───────────────────────────────────────────────────────────────

function parseArgs(): CLIArgs {
  const argv = process.argv.slice(2);
  const idx = (flag: string) => argv.indexOf(flag);
  return {
    all: argv.includes('--all'),
    state: idx('--state') >= 0 ? argv[idx('--state') + 1] : undefined,
    dryRun: argv.includes('--dry-run'),
    geocode: argv.includes('--geocode'),
    geocodeOnly: argv.includes('--geocode-only'),
    resumeFrom: idx('--resume-from') >= 0 ? parseInt(argv[idx('--resume-from') + 1], 10) : undefined,
    jobId: idx('--job-id') >= 0 ? argv[idx('--job-id') + 1] : undefined,
  };
}

async function main(): Promise<void> {
  const args = parseArgs();
  const stats: ImportStats = {
    citiesProcessed: 0, churchesFound: 0, churchesCreated: 0,
    churchesUpdated: 0, massSchedulesCreated: 0,
    geocoded: 0, geocodeFailed: 0, errors: 0,
  };

  console.log('\n' + '='.repeat(70));
  console.log('HORARIO DE MISSA (BRAZIL) IMPORTER');
  console.log('='.repeat(70));
  console.log(`Mode: ${args.geocodeOnly ? 'geocode-only' : args.dryRun ? 'dry-run' : 'import'}`);
  if (args.state) console.log(`State filter: ${args.state}`);
  if (args.resumeFrom) console.log(`Resume from: ${args.resumeFrom}`);
  console.log(`Time: ${new Date().toISOString()}\n`);

  try {
    if (args.geocodeOnly) {
      await runGeocodeOnly(stats);
    } else if (args.all || args.state) {
      console.log('Loading existing BR churches...');
      const existingChurches = await prisma.church.findMany({
        where: { country: 'BR' },
        select: {
          id: true, name: true, latitude: true, longitude: true,
          osmId: true, baiduId: true, masstimesId: true, orarimesseId: true,
          massSchedulesPhId: true, philmassId: true, horariosMisasId: true,
          mszeInfoId: true, weekdayMassesId: true, messesInfoId: true,
          bohosluzbyId: true, miserendId: true, kerknetId: true,
          gottesdienstzeitenId: true, horarioDemissaId: true, misasOrgId: true,
          source: true, website: true, phone: true, address: true, country: true,
        },
      }) as ExistingChurch[];
      console.log(`Loaded ${existingChurches.length} existing BR churches\n`);

      const cities = await fetchCityUrls(args.state);
      const startIndex = args.resumeFrom ?? 0;

      for (let i = startIndex; i < cities.length; i++) {
        const { state, city, url } = cities[i];
        console.log(`[${i + 1}/${cities.length}] ${state} / ${city}`);

        const html = await fetchPage(url);
        if (!html) { stats.errors++; continue; }

        const churches = parseCityPage(html, city, state);
        stats.churchesFound += churches.length;
        stats.citiesProcessed++;
        console.log(`  ${churches.length} churches`);

        for (const church of churches) {
          await upsertChurch(church, existingChurches, args, stats);
        }

        if (args.geocode && !args.dryRun) {
          for (const church of churches) {
            if (!church.address) continue;
            const dbChurch = await prisma.church.findUnique({
              where: { horarioDemissaId: church.key },
              select: { id: true, latitude: true },
            });
            // latitude === 0 is the sentinel for "no real coordinates yet"
            if (dbChurch && dbChurch.latitude === 0) {
              const coords = await geocodeAddress(church.address, church.city, church.state);
              if (coords) {
                await prisma.church.update({ where: { id: dbChurch.id }, data: { latitude: coords.lat, longitude: coords.lng } });
                stats.geocoded++;
              } else {
                stats.geocodeFailed++;
              }
            }
          }
        }
      }
    } else {
      console.error('Usage: --all | --state XX | --geocode-only');
      process.exit(1);
    }
  } finally {
    await prisma.$disconnect();
    await pool.end();
  }

  console.log('\n' + '='.repeat(70));
  console.log('SUMMARY');
  console.log('='.repeat(70));
  console.log(`Cities processed:    ${stats.citiesProcessed}`);
  console.log(`Churches found:      ${stats.churchesFound}`);
  console.log(`  Created:           ${stats.churchesCreated}`);
  console.log(`  Updated:           ${stats.churchesUpdated}`);
  console.log(`  Errors:            ${stats.errors}`);
  console.log(`Mass schedules:      ${stats.massSchedulesCreated}`);
  if (args.geocode || args.geocodeOnly) {
    console.log(`Geocoded:            ${stats.geocoded} / Failed: ${stats.geocodeFailed}`);
  }
  console.log('='.repeat(70) + '\n');
}

main().catch(console.error);
  • Step 5: Test dry-run on small state
npx tsx scripts/import-horariodemissa.ts --state DF --dry-run

Expected: Lists churches from Distrito Federal (Brasília) without DB writes.

  • Step 6: Test real import on smallest state (Roraima)
npx tsx scripts/import-horariodemissa.ts --state RR

Then verify:

npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
import { prisma } from './src/lib/db.ts';
const count = await prisma.church.count({ where: { country: 'BR' } });
const sched = await prisma.massSchedule.count({ where: { church: { country: 'BR' } } });
console.log('BR churches:', count, '| Mass schedules:', sched);
await prisma.\$disconnect();
"

Expected: Some churches from Roraima with mass schedules in DB.

  • Step 7: Commit
git add scripts/import-horariodemissa.ts
git commit -m "feat: complete horariodemissa importer (Brazil, 8895 churches + 28523 mass times)"

Chunk 3: Spain Importer (import-misas.ts)

Depends on Chunk 1. Can run in parallel with Chunk 2.

Task 6: API pagination + boilerplate

Files:

  • Create: scripts/import-misas.ts

  • Step 1: Create script with boilerplate + API pagination

Create scripts/import-misas.ts:

#!/usr/bin/env tsx
/**
 * Import Catholic churches from misas.org (Spain)
 *
 * misas.org lists 17,919 Spanish parishes with name, address, coordinates,
 * and province via a public JSON REST API. Mass schedules are auth-gated
 * (401 on detail endpoint), so this importer creates/updates church records
 * only — no schedule data.
 *
 * The listing API accepts offset-based pagination. We use Madrid as the center
 * with a large radius (999999m) to cover all of Spain in a single stream.
 *
 * Import strategy:
 *   1. Paginate GET /api/parishsearch?country=es&pos=[...]&offset=N&limit=500
 *   2. For each parish: id, name, addr, loc (city), prov (province), zip, lat, long
 *   3. Match against existing ES churches by misasOrgId or proximity+name
 *   4. Upsert church record (no mass schedules)
 *
 * Usage:
 *   npx tsx scripts/import-misas.ts --all
 *   npx tsx scripts/import-misas.ts --all --dry-run
 *   npx tsx scripts/import-misas.ts --all --resume-from 5000
 *   npx tsx scripts/import-misas.ts --all --job-id {uuid}
 */

import dotenv from 'dotenv';
import path from 'path';

dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });

import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';

const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
  connectionString: dbUrl,
  ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });

import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';

// ─── Constants ───────────────────────────────────────────────────────────────

const API_BASE = 'https://misas.org/api/parishsearch';
// Madrid coordinates, large radius covers all of Spain
const SPAIN_POS = encodeURIComponent('[-3.7038,40.4168,999999]');
const PAGE_SIZE = 500;
const REQUEST_DELAY_MS = 500;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';

// ─── Types ───────────────────────────────────────────────────────────────────

interface MisasParish {
  id: number;
  name: string;
  uri: string;
  addr: string;
  loc: string;   // city
  prov: string;  // province
  zip: string;
  lat: string;
  long: string;
}

interface MisasApiResponse {
  count: number;
  pars: MisasParish[];
}

interface CLIArgs {
  all: boolean;
  dryRun: boolean;
  resumeFrom?: number;
  jobId?: string;
}

interface ImportStats {
  total: number;
  created: number;
  updated: number;
  errors: number;
}

// ─── HTTP Client ──────────────────────────────────────────────────────────────

let requestCount = 0;

function delay(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function fetchParishes(offset: number): Promise<MisasApiResponse | null> {
  if (requestCount > 0) await delay(REQUEST_DELAY_MS);
  requestCount++;

  const url = `${API_BASE}?country=es&pos=${SPAIN_POS}&offset=${offset}&limit=${PAGE_SIZE}`;

  try {
    const response = await fetch(url, {
      headers: {
        'User-Agent': USER_AGENT,
        'Accept': 'application/json',
        'Referer': 'https://misas.org/',
      },
    });

    if (!response.ok) {
      console.error(`  HTTP ${response.status} at offset ${offset}`);
      return null;
    }

    return await response.json() as MisasApiResponse;
  } catch (error) {
    console.error(`  Fetch error at offset ${offset}: ${error instanceof Error ? error.message : error}`);
    return null;
  }
}

// ─── Pagination ───────────────────────────────────────────────────────────────

export async function* paginateParishes(startOffset: number = 0): AsyncGenerator<MisasParish> {
  let offset = startOffset;
  let totalKnown = Infinity;

  while (offset < totalKnown) {
    console.log(`  Fetching offset ${offset}${totalKnown < Infinity ? `/${totalKnown}` : ''}...`);
    const data = await fetchParishes(offset);

    if (!data || !data.pars || data.pars.length === 0) break;

    totalKnown = data.count;
    for (const parish of data.pars) {
      yield parish;
    }

    offset += data.pars.length;
  }
}
  • Step 2: Verify API returns expected data
npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
const { paginateParishes } = await import('./scripts/import-misas.ts');
let count = 0;
for await (const p of paginateParishes()) {
  if (count === 0) console.log('First parish:', JSON.stringify(p, null, 2));
  count++;
  if (count >= 5) break;
}
console.log('Fetched:', count, 'from first batch');
"

Expected: Parish objects with id, name, lat, long, addr, loc, prov fields.

  • Step 3: Commit
git add scripts/import-misas.ts
git commit -m "feat: misas.org importer scaffold + API pagination"

Task 7: DB upsert + main()

Files:

  • Modify: scripts/import-misas.ts

  • Step 1: Add upsertParish + main()

Note: latitude/longitude are Float (non-nullable) — use 0 as sentinel when coordinates are missing. Set source explicitly to 'misas-org' — the schema default is "masstimes".

// ─── DB Upsert ────────────────────────────────────────────────────────────────

async function upsertParish(
  parish: MisasParish,
  existingChurches: ExistingChurch[],
  args: CLIArgs,
  stats: ImportStats
): Promise<void> {
  const lat = parseFloat(parish.lat);
  const lng = parseFloat(parish.long);
  const misasOrgId = String(parish.id);
  const resolvedLat = isNaN(lat) ? 0 : lat;
  const resolvedLng = isNaN(lng) ? 0 : lng;

  const candidate = {
    name: parish.name,
    lat: resolvedLat,
    lng: resolvedLng,
    misasOrgId,
  };

  const existing = findDuplicateChurch(candidate, existingChurches);

  if (args.dryRun) {
    console.log(`  [dry-run] ${existing ? 'UPDATE' : 'CREATE'} ${parish.name} (${misasOrgId})`);
    stats.total++;
    if (existing) stats.updated++; else stats.created++;
    return;
  }

  try {
    const church = await prisma.church.upsert({
      where: { misasOrgId },
      create: {
        misasOrgId,
        name: parish.name,
        address: parish.addr || null,
        city: parish.loc || null,
        state: parish.prov || null,
        zip: parish.zip || null,
        country: 'ES',
        source: 'misas-org',   // must set explicitly — schema default is "masstimes"
        latitude: resolvedLat, // 0 = no real coordinates; misas.org provides coords for most
        longitude: resolvedLng,
        lastScrapedAt: new Date(),
        scrapeStrategy: 'misas-org',
      },
      update: {
        name: parish.name,
        address: parish.addr || undefined,
        city: parish.loc || undefined,
        state: parish.prov || undefined,
        zip: parish.zip || undefined,
        // Only update coords if we have real values (don't overwrite good data with 0)
        ...(resolvedLat !== 0 && { latitude: resolvedLat, longitude: resolvedLng }),
        misasOrgId,  // stamp ID even if matched by proximity
        lastScrapedAt: new Date(),
      },
    });

    if (existing) {
      stats.updated++;
    } else {
      stats.created++;
      existingChurches.push({
        id: church.id, name: parish.name,
        latitude: resolvedLat, longitude: resolvedLng,
        osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
        massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
        mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
        bohosluzbyId: null, miserendId: null, kerknetId: null,
        gottesdienstzeitenId: null, horarioDemissaId: null, misasOrgId,
        source: 'misas-org', website: null, phone: null,
        address: parish.addr || null, country: 'ES',
      });
    }
    stats.total++;
  } catch (error) {
    console.error(`  Error upserting ${parish.name}: ${error instanceof Error ? error.message : error}`);
    stats.errors++;
    stats.total++;  // count errors in total so progress log fires correctly
  }
}

// ─── CLI + Main ───────────────────────────────────────────────────────────────

// Note: --job-id is accepted for scheduler compatibility but BackgroundJob status
// tracking is not wired up in this importer (acceptable for v1 — add later if needed).
function parseArgs(): CLIArgs {
  const argv = process.argv.slice(2);
  const idx = (flag: string) => argv.indexOf(flag);
  return {
    all: argv.includes('--all'),
    dryRun: argv.includes('--dry-run'),
    resumeFrom: idx('--resume-from') >= 0 ? parseInt(argv[idx('--resume-from') + 1], 10) : undefined,
    jobId: idx('--job-id') >= 0 ? argv[idx('--job-id') + 1] : undefined,
  };
}

async function main(): Promise<void> {
  const args = parseArgs();
  const stats: ImportStats = { total: 0, created: 0, updated: 0, errors: 0 };

  console.log('\n' + '='.repeat(70));
  console.log('MISAS.ORG (SPAIN) IMPORTER');
  console.log('='.repeat(70));
  console.log(`Mode: ${args.dryRun ? 'dry-run' : 'import'}`);
  if (args.resumeFrom) console.log(`Resume from offset: ${args.resumeFrom}`);
  console.log(`Time: ${new Date().toISOString()}\n`);

  if (!args.all) {
    console.error('Usage: --all [--dry-run] [--resume-from N]');
    process.exit(1);
  }

  try {
    console.log('Loading existing ES churches...');
    const existingChurches = await prisma.church.findMany({
      where: { country: 'ES' },
      select: {
        id: true, name: true, latitude: true, longitude: true,
        osmId: true, baiduId: true, masstimesId: true, orarimesseId: true,
        massSchedulesPhId: true, philmassId: true, horariosMisasId: true,
        mszeInfoId: true, weekdayMassesId: true, messesInfoId: true,
        bohosluzbyId: true, miserendId: true, kerknetId: true,
        gottesdienstzeitenId: true, horarioDemissaId: true, misasOrgId: true,
        source: true, website: true, phone: true, address: true, country: true,
      },
    }) as ExistingChurch[];
    console.log(`Loaded ${existingChurches.length} existing ES churches\n`);

    for await (const parish of paginateParishes(args.resumeFrom ?? 0)) {
      await upsertParish(parish, existingChurches, args, stats);

      if (stats.total % 500 === 0) {
        console.log(`  Progress: ${stats.total} processed (${stats.created} created, ${stats.updated} updated)`);
      }
    }
  } finally {
    await prisma.$disconnect();
    await pool.end();
  }

  console.log('\n' + '='.repeat(70));
  console.log('SUMMARY');
  console.log('='.repeat(70));
  console.log(`Total processed: ${stats.total}`);
  console.log(`  Created:       ${stats.created}`);
  console.log(`  Updated:       ${stats.updated}`);
  console.log(`  Errors:        ${stats.errors}`);
  console.log('='.repeat(70) + '\n');
}

main().catch(console.error);
  • Step 2: Test dry-run end-to-end
npx tsx scripts/import-misas.ts --all --dry-run 2>&1 | tail -20

Expected: Processes all 17,919 parishes, shows Total processed: 17919 with created/updated split.

  • Step 3: Commit
git add scripts/import-misas.ts
git commit -m "feat: complete misas.org importer (Spain, 17919 churches with coordinates)"

Chunk 4: Integration

Task 8: package.json + scheduler

Files:

  • Modify: package.json

  • Modify: scripts/scheduler.ts

  • Step 1: Add npm scripts

In package.json "scripts" block, add after "import:masstimes-api":

"import:horariodemissa": "tsx scripts/import-horariodemissa.ts",
"import:misas": "tsx scripts/import-misas.ts"
  • Step 2: Add getJobCommand cases in scheduler.ts

In scripts/scheduler.ts, add before default: in getJobCommand():

case 'horariodemissa-import': {
  const args = ['tsx', 'scripts/import-horariodemissa.ts', '--all'];
  if (config?.state) args.push('--state', String(config.state));
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  if (config?.geocode) args.push('--geocode');
  return { command: 'npx', args };
}
case 'misas-import': {
  const args = ['tsx', 'scripts/import-misas.ts', '--all'];
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}
  • Step 3: Add to PIPELINE_GROUPS imports sequence

In PIPELINE_GROUPS[0].phases, add after the masstimes-api-import entry:

{ name: 'horariodemissa-import', type: 'horariodemissa-import', config: {} },
{ name: 'misas-import', type: 'misas-import', config: {} },
  • Step 4: Verify TypeScript
npx tsc --noEmit

Expected: no errors.

  • Step 5: Smoke test both npm scripts
npm run import:horariodemissa -- --state DF --dry-run 2>&1 | tail -10
npm run import:misas -- --all --dry-run 2>&1 | tail -10
  • Step 6: Commit
git add package.json scripts/scheduler.ts
git commit -m "feat: add horariodemissa and misas.org to npm scripts and scheduler pipeline"

Final Verification

  • Import small state from Brazil to confirm end-to-end
npx tsx scripts/import-horariodemissa.ts --state DF
npx tsx -e "
import dotenv from 'dotenv'; dotenv.config({ path: '.env' });
import { prisma } from './src/lib/db.ts';
const churches = await prisma.church.count({ where: { country: 'BR' } });
const schedules = await prisma.massSchedule.count({ where: { church: { country: 'BR' } } });
console.log('BR churches:', churches, '| Mass schedules:', schedules);
await prisma.\$disconnect();
"

Expected: Distrito Federal churches in DB with mass schedules.

  • Dry-run Spain importer full pass
npx tsx scripts/import-misas.ts --all --dry-run 2>&1 | grep -E "SUMMARY|Total|Created|Updated" | tail -10

Expected: ~17,919 total, mix of created vs updated depending on existing ES church overlap.