Files
ScraperControl/docs/superpowers/plans/2026-03-10-discovermass-importer.md
albertfj114 6e9ada7fdf fix: harden discovermass plan against coord validation and regex slowdown
- Validate lat/lng from daddr= (bounds check + isFinite) before storing
- Cap HTML to 100KB before regex matching to prevent backtracking on large pages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 22:34:51 -04:00

41 KiB
Raw Blame History

DiscoverMass.com Importer Implementation Plan

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Import 20,284 US Catholic churches with mass/confession/adoration schedules from discovermass.com into the NearestMass database.

Architecture: Enumerate 11 WordPress sitemaps → fetch each church page at 10s intervals (respecting Crawl-delay) → parse server-rendered HTML for name/address/coordinates/schedules → match against existing US churches via church-matcher → upsert with full schedule data.

Tech Stack: TypeScript/tsx, Prisma 7 + PrismaPg adapter, pg Pool, Node.js fetch, regex HTML parsing (no DOM library needed — HTML is server-rendered and predictable).


Chunk 1: Schema + church-matcher

Task 1: Add discovermassId to schema

Files:

  • Modify: prisma/schema.prisma

The schema lives in this repo but migrations run in BethelGuide. After editing schema.prisma here, run npx prisma generate to regenerate the Prisma client. Do NOT run prisma migrate.

  • Step 1: Find the right place in schema.prisma

Open prisma/schema.prisma. Find the block of source ID fields — they look like:

gottesdienstzeitenId  String?   @unique @map("gottesdienstzeiten_id")

This is inside the model Church { ... } block, after kerknetId and before claimed.

  • Step 2: Add discovermassId field

After gottesdienstzeitenId:

discovermassId        String?   @unique @map("discovermass_id")

Also find the @@index block near the bottom of the Church model (it groups all the index definitions). Add:

@@index([discovermassId])
  • Step 3: Regenerate Prisma client
cd /home/albert/Documents/ScraperControl
npx prisma generate

Expected output: ✔ Generated Prisma Client (no errors). This does NOT touch the database — it only updates the TypeScript client.

  • Step 4: Apply migration to database

The schema source of truth is BethelGuide. Run the migration there, then sync back. Since we're on the same dev server:

# Check if discovermass_id column already exists (it shouldn't yet)
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "\d churches" | grep discovermass

If the column doesn't exist, apply it directly:

psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "
ALTER TABLE churches ADD COLUMN IF NOT EXISTS discovermass_id VARCHAR UNIQUE;
CREATE INDEX IF NOT EXISTS churches_discovermass_id_idx ON churches(discovermass_id);
"

Expected output: ALTER TABLE and CREATE INDEX

  • Step 5: Verify column exists
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "\d churches" | grep discovermass

Expected output: discovermass_id | character varying | ...

  • Step 6: Commit
cd /home/albert/Documents/ScraperControl
git add prisma/schema.prisma
git commit -m "feat: add discovermassId field to Church schema"

Task 2: Update church-matcher

Files:

  • Modify: src/lib/church-matcher.ts

The ExistingChurch interface (line ~11) lists all source IDs. The ChurchCandidate type (line ~122) lists optional source IDs for the candidate. The findDuplicateChurch function has sequential passes checking each ID before falling back to proximity+name.

  • Step 1: Add discovermassId to ExistingChurch interface

Find the export interface ExistingChurch { block. After the gottesdienstzeitenId line, add:

discovermassId: string | null;
  • Step 2: Add discovermassId to ChurchCandidate type

Find export type ChurchCandidate = {. After gottesdienstzeitenId?: string;, add:

discovermassId?: string;
  • Step 3: Add discovermassId matching pass in findDuplicateChurch

Find the findDuplicateChurch function. It has a series of passes like:

if (candidate.gottesdienstzeitenId) {
  const match = existingChurches.find(c => c.gottesdienstzeitenId === candidate.gottesdienstzeitenId);
  if (match) return match;
}
// Proximity + name similarity

Add a new pass BEFORE the proximity+name pass (after gottesdienstzeitenId):

if (candidate.discovermassId) {
  const match = existingChurches.find(c => c.discovermassId === candidate.discovermassId);
  if (match) return match;
}
  • Step 4: Update all callers that construct ExistingChurch objects

Search for places that build ExistingChurch objects (the in-memory push after creating a new church). Each importer has a block like:

existingChurches.push({
  id: newChurch.id,
  ...
  gottesdienstzeitenId: null,
  ...
});

Run:

grep -rn "gottesdienstzeitenId: null" scripts/

For each file found: add discovermassId: null, after gottesdienstzeitenId: null,. These are the in-memory dedup arrays — they need the new field or TypeScript will complain.

Also update the loadExistingChurches select queries if any importer has one (check with grep -rn "gottesdienstzeitenId: true" scripts/).

  • Step 5: Verify TypeScript compiles
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit

Expected: no errors. Fix any type errors (they'll be missing discovermassId fields).

  • Step 6: Commit
# Stage church-matcher AND all importer scripts that were updated in Step 4
git add src/lib/church-matcher.ts
git add scripts/
git commit -m "feat: add discovermassId to church-matcher ExistingChurch and ChurchCandidate"

Chunk 2: import-discovermass.ts — utilities and parsing

Task 3: Create file skeleton + utilities

Files:

  • Create: scripts/import-discovermass.ts

  • Step 1: Create the file with header, imports, constants, types

Create scripts/import-discovermass.ts with this content:

#!/usr/bin/env tsx
/**
 * Import Catholic churches and mass schedules from discovermass.com (USA)
 *
 * discovermass.com is a US Catholic church directory with 20,284 churches.
 * Data includes name, address, phone, website, coordinates, mass times,
 * confessions, and adoration schedules.
 *
 * robots.txt specifies Crawl-delay: 10 — this importer follows that rule.
 *
 * Usage:
 *   npx tsx scripts/import-discovermass.ts --all
 *   npx tsx scripts/import-discovermass.ts --all --dry-run
 *   npx tsx scripts/import-discovermass.ts --all --resume-from 5000
 *   npx tsx scripts/import-discovermass.ts --all --job-id {uuid}
 */

import dotenv from 'dotenv';
import path from 'path';

dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });

import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';

const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
  connectionString: dbUrl,
  ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });

import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';

// ─── Constants ───────────────────────────────────────────────────────────────

const SITE_BASE = 'https://discovermass.com';
const SITEMAP_COUNT = 11;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 10_000; // Crawl-delay: 10 from robots.txt

// ─── Types ───────────────────────────────────────────────────────────────────

interface ParsedChurch {
  name: string;
  address: string | null;
  city: string | null;
  state: string | null;
  zip: string | null;
  phone: string | null;
  website: string | null;
  lat: number;
  lng: number;
}

interface ParsedMass {
  dayOfWeek: number; // 0=Sun, 1=Mon, ..., 6=Sat
  time: string;      // HH:MM 24-hour
  language: string;
  notes?: string;
}

interface ParsedConf {
  dayOfWeek: number;
  startTime: string; // HH:MM 24-hour
  endTime: string;   // HH:MM 24-hour
  notes?: string;
}

interface ParsedAdoration {
  dayOfWeek: number;
  startTime: string; // HH:MM 24-hour
  endTime: string;   // HH:MM 24-hour
  notes?: string;
}

interface ImportStats {
  total: number;
  created: number;
  updated: number;
  skipped: number;
  errors: number;
  massSchedulesCreated: number;
  confessionSchedulesCreated: number;
  adorationSchedulesCreated: number;
}

interface CLIArgs {
  all: boolean;
  dryRun: boolean;
  resumeFrom?: number;
  jobId?: string;
}
  • Step 2: Add day mappings and time utilities

Append to the file:

// ─── Day Mappings ─────────────────────────────────────────────────────────────

// Full day names used in mass schedule <li> labels
const FULL_DAY_NAMES: Record<string, number> = {
  Sunday: 0, Monday: 1, Tuesday: 2, Wednesday: 3,
  Thursday: 4, Friday: 5, Saturday: 6,
};

// Abbreviated day prefixes used in confession/adoration serviceTime text
const ABBREV_DAY_NAMES: Record<string, number[]> = {
  Sun: [0], Mon: [1], Tue: [2], Wed: [3],
  Thr: [4], Thu: [4], Fri: [5], Sat: [6],
  Weekdays: [1, 2, 3, 4, 5],
  Daily: [0, 1, 2, 3, 4, 5, 6],
};

// ─── Time Utilities ───────────────────────────────────────────────────────────

/**
 * Convert "5:00pm", "11:00am", "12:00pm", "12:00am" to "HH:MM" 24-hour format.
 * Returns the original string unchanged if it doesn't match expected format.
 */
function convertTo24h(timeStr: string): string {
  const cleaned = timeStr.trim().toLowerCase();
  const m = cleaned.match(/^(\d{1,2}):(\d{2})(am|pm)$/);
  if (!m) return cleaned;
  let hours = parseInt(m[1], 10);
  const mins = m[2];
  const meridiem = m[3];
  if (meridiem === 'pm' && hours !== 12) hours += 12;
  if (meridiem === 'am' && hours === 12) hours = 0;
  return `${String(hours).padStart(2, '0')}:${mins}`;
}

/**
 * Parse "8:30am-9:00am" → ["08:30", "09:00"].
 * Handles the case where both sides need to infer AM/PM from context.
 * E.g. "8:30am-9:00am" → both explicit. "9:00am-6:00pm" → both explicit.
 */
function parseTimeRange(rangeStr: string): [string, string] {
  // Split on '-' but careful: times may contain only one '-' between start and end
  // Pattern: "8:30am-9:00am" or "3:30pm-4:30pm"
  const hyphenIdx = rangeStr.indexOf('-', rangeStr.indexOf(':') + 1);
  if (hyphenIdx === -1) {
    const t = convertTo24h(rangeStr.trim());
    return [t, t];
  }
  const start = convertTo24h(rangeStr.slice(0, hyphenIdx).trim());
  const end = convertTo24h(rangeStr.slice(hyphenIdx + 1).trim());
  return [start, end];
}

/**
 * Expand abbreviated day prefix to array of dayOfWeek integers.
 * Returns empty array if prefix is not recognized.
 */
function expandDayAbbrev(prefix: string): number[] {
  return ABBREV_DAY_NAMES[prefix] ?? [];
}

// ─── Address Parsing ──────────────────────────────────────────────────────────

/**
 * Parse "14085 Peyton Drive, Chino Hills, CA 91709" into components.
 * Returns partial result on malformed input.
 */
function parseAddress(raw: string): { address: string | null; city: string | null; state: string | null; zip: string | null } {
  const parts = raw.split(', ');
  if (parts.length < 3) return { address: raw, city: null, state: null, zip: null };
  const last = parts[parts.length - 1].trim();
  const stateZipMatch = last.match(/^([A-Z]{2})\s+(\d{5}(?:-\d{4})?)$/);
  if (!stateZipMatch) return { address: raw, city: null, state: null, zip: null };
  return {
    address: parts.slice(0, parts.length - 2).join(', ').trim(),
    city: parts[parts.length - 2].trim(),
    state: stateZipMatch[1],
    zip: stateZipMatch[2],
  };
}
  • Step 3: Verify utilities compile
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit

Expected: no errors related to import-discovermass.ts. Other files may have pre-existing errors — focus only on this file's errors.


Task 4: Add HTML parsing functions

Files:

  • Modify: scripts/import-discovermass.ts

The HTML is server-rendered. The page structure has:

  • <meta property="og:title"> for church name

  • US address embedded as text in a known pattern

  • <div id="sidebar-info"> for phone/website/coordinates

  • Two <ul> blocks: one containing <h5>Mass Times</h5>, another containing <h5>Other Services</h5>

  • Step 1: Add parseChurch function

Append to the file:

// ─── HTML Parsing ─────────────────────────────────────────────────────────────

/**
 * Parse church metadata from page HTML.
 * Returns null if the page doesn't look like a valid church listing.
 */
function parseChurch(html: string): ParsedChurch | null {
  // Name from OpenGraph meta tag
  const nameMatch = html.match(/<meta property="og:title" content="([^"]+)"/);
  if (!nameMatch) return null;
  const name = nameMatch[1].trim();
  if (!name || name === 'Discover Mass') return null;

  // Address: match US address pattern (number + street, city, STATE ZIP)
  let address: string | null = null;
  let city: string | null = null;
  let state: string | null = null;
  let zip: string | null = null;
  const addrMatch = html.match(/(\d+[^<\n,]+),\s*([^<,\n]+),\s*([A-Z]{2})\s+(\d{5}(?:-\d{4})?)/);
  if (addrMatch) {
    const raw = `${addrMatch[1].trim()}, ${addrMatch[2].trim()}, ${addrMatch[3]} ${addrMatch[4]}`;
    const parsed = parseAddress(raw);
    address = parsed.address;
    city = parsed.city;
    state = parsed.state;
    zip = parsed.zip;
  }

  // Phone from sidebar
  const phoneMatch = html.match(/<span class='side-phone attribute'>([^<]+)<\/span>/);
  const phone = phoneMatch ? phoneMatch[1].trim() : null;

  // Website from sidebar
  const websiteMatch = html.match(/<span class='side-website attribute'><a href='([^']+)'/);
  const website = websiteMatch ? websiteMatch[1].trim() : null;

  // Coordinates from Google Maps daddr parameter
  let lat = 0;
  let lng = 0;
  const coordMatch = html.match(/daddr=([-\d.]+),([-\d.]+)/);
  if (coordMatch) {
    const rawLat = parseFloat(coordMatch[1]);
    const rawLng = parseFloat(coordMatch[2]);
    // Validate: reject NaN, Infinity, and out-of-range values; fall back to 0 sentinel
    if (isFinite(rawLat) && isFinite(rawLng) && Math.abs(rawLat) <= 90 && Math.abs(rawLng) <= 180) {
      lat = rawLat;
      lng = rawLng;
    }
  }

  return { name, address, city, state, zip, phone, website, lat, lng };
}
  • Step 2: Add parseMassTimes function

Append to the file:

/**
 * Parse the mass schedule from the "Mass Times" <ul> block.
 *
 * HTML structure:
 * <ul><li><h5>Mass Times</h5></li>
 *   <li class=""><span class="label">Saturday</span>
 *     <span class='serviceTime'><span class='time'>5:00pm</span></span>,
 *     <span class='serviceTime'><span class='time'>7:00pm</span> <span class='language'>(Vietnamese)</span></span>
 *   </li>
 * </ul>
 */
function parseMassTimes(html: string): ParsedMass[] {
  // Cap HTML to first 100KB to prevent slow regex backtracking on malformed pages
  const safeHtml = html.length > 100_000 ? html.slice(0, 100_000) : html;
  const massUlMatch = safeHtml.match(/<ul>\s*<li>\s*<h5>Mass Times<\/h5>[\s\S]*?<\/ul>/);
  if (!massUlMatch) return [];
  const massUl = massUlMatch[0];

  const results: ParsedMass[] = [];

  // Each day is in a <li ...> (the first li has the h5 header, skip it).
  // Use regex split to handle any class value on the li, not just empty class.
  const liParts = massUl.split(/<li[^>]*>/);
  for (let i = 1; i < liParts.length; i++) {
    const li = liParts[i];

    const labelMatch = li.match(/<span class="label">([^<]+)<\/span>/);
    if (!labelMatch) continue;
    const dayLabel = labelMatch[1].trim();
    const dayOfWeek = FULL_DAY_NAMES[dayLabel];
    if (dayOfWeek === undefined) continue;

    // Each time entry is in a <span class='serviceTime'>
    const serviceTimeParts = li.split("<span class='serviceTime'>");
    for (let j = 1; j < serviceTimeParts.length; j++) {
      const st = serviceTimeParts[j];
      const timeMatch = st.match(/<span class='time'>([^<]+)<\/span>/);
      if (!timeMatch) continue;
      const time = convertTo24h(timeMatch[1].trim());

      const langMatch = st.match(/<span class='language'>\(([^)]+)\)<\/span>/);
      const language = langMatch ? langMatch[1].trim() : 'English';

      const commentMatch = st.match(/<span class='comment'>([^<]+)<\/span>/);
      const notes = commentMatch ? commentMatch[1].trim() : undefined;

      results.push({ dayOfWeek, time, language, notes });
    }
  }

  return results;
}
  • Step 3: Add parseOtherServices function

Append to the file:

/**
 * Parse confessions and adoration from the "Other Services" <ul> block.
 *
 * HTML structure:
 * <ul><li><h5>Other Services</h5></li>
 *   <li class="Confessions"><span class="label">Confessions</span>
 *     <span class='serviceTime'>Tue: <span class='time'>8:30am-9:00am</span></span>
 *     <span class='serviceTime'>Sat: <span class='time'>3:30pm-4:30pm</span> - <span class='comment'>In the Chapel</span></span>
 *   </li>
 *   <li class="Adoration"><span class="label">Adoration</span>
 *     <span class='serviceTime'>Weekdays: <span class='time'>9:00am-6:00pm</span> - <span class='comment'>In the Chapel</span></span>
 *   </li>
 * </ul>
 */
function parseOtherServices(html: string): { confessions: ParsedConf[]; adorations: ParsedAdoration[] } {
  // Cap HTML to first 100KB to prevent slow regex backtracking on malformed pages
  const safeHtml = html.length > 100_000 ? html.slice(0, 100_000) : html;
  const otherUlMatch = safeHtml.match(/<ul>\s*<li>\s*<h5>Other Services<\/h5>[\s\S]*?<\/ul>/);
  if (!otherUlMatch) return { confessions: [], adorations: [] };
  const otherUl = otherUlMatch[0];

  function parseServiceItems(liHtml: string): Array<{ dayOfWeek: number; startTime: string; endTime: string; notes?: string }> {
    const items: Array<{ dayOfWeek: number; startTime: string; endTime: string; notes?: string }> = [];
    const stParts = liHtml.split("<span class='serviceTime'>");
    for (let i = 1; i < stParts.length; i++) {
      const st = stParts[i];
      // Text before <span class='time'> contains the day abbreviation and colon
      const dayTimeMatch = st.match(/^([A-Za-z]+):\s*<span class='time'>([^<]+)<\/span>/);
      if (!dayTimeMatch) continue;
      const days = expandDayAbbrev(dayTimeMatch[1].trim());
      if (days.length === 0) continue;
      const [startTime, endTime] = parseTimeRange(dayTimeMatch[2]);
      const commentMatch = st.match(/<span class='comment'>([^<]+)<\/span>/);
      const notes = commentMatch ? commentMatch[1].trim() : undefined;
      for (const dayOfWeek of days) {
        items.push({ dayOfWeek, startTime, endTime, notes });
      }
    }
    return items;
  }

  const confessions: ParsedConf[] = [];
  const adorations: ParsedAdoration[] = [];

  const confMatch = otherUl.match(/<li class="Confessions">[\s\S]*?<\/li>/);
  if (confMatch) confessions.push(...parseServiceItems(confMatch[0]));

  const adorMatch = otherUl.match(/<li class="Adoration">[\s\S]*?<\/li>/);
  if (adorMatch) adorations.push(...parseServiceItems(adorMatch[0]));

  return { confessions, adorations };
}
  • Step 4: Smoke-test parsing on a real page

Create a quick test at the end of the file temporarily:

// TEMP: smoke test — remove before committing
if (process.argv[2] === '--test-parse') {
  const testUrl = 'https://discovermass.com/church/st-paul-the-apostle-chino-hills-ca/';
  const html = await (await fetch(testUrl, { headers: { 'User-Agent': USER_AGENT } })).text();
  const church = parseChurch(html);
  const masses = parseMassTimes(html);
  const { confessions, adorations } = parseOtherServices(html);
  console.log('Church:', church);
  console.log('Masses:', masses.length, masses.slice(0, 3));
  console.log('Confessions:', confessions.length, confessions);
  console.log('Adorations:', adorations.length, adorations);
  process.exit(0);
}

Run it:

cd /home/albert/Documents/ScraperControl
npx tsx scripts/import-discovermass.ts --test-parse

Expected output:

Church: { name: 'St. Paul the Apostle', address: '14085 Peyton Drive', city: 'Chino Hills', state: 'CA', zip: '91709', phone: '(909) 465-5503', website: 'http://www.sptacc.org', lat: 33.996887, lng: -117.732407 }
Masses: 14 [ { dayOfWeek: 6, time: '17:00', language: 'English', notes: undefined }, ... ]
Confessions: 4 [...]
Adorations: 6 [...]

If the counts don't match, debug the regex patterns against the raw HTML:

curl -s https://discovermass.com/church/st-paul-the-apostle-chino-hills-ca/ > /tmp/dm-test.html
# Then inspect the relevant sections manually
  • Step 5: Remove the temp test block and commit

Remove the if (process.argv[2] === '--test-parse') block from the file.

git add scripts/import-discovermass.ts
git commit -m "feat: add discovermass parsing utilities (church, mass, confession, adoration)"

Chunk 3: import-discovermass.ts — main import loop

Task 5: Add HTTP helpers + sitemap enumeration

Files:

  • Modify: scripts/import-discovermass.ts

  • Step 1: Add HTTP helpers and loadExistingChurches

Append to the file:

// ─── HTTP Helpers ─────────────────────────────────────────────────────────────

async function fetchHtml(url: string): Promise<string> {
  const res = await fetch(url, {
    headers: { 'User-Agent': USER_AGENT },
  });
  if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`);
  return res.text();
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// ─── Sitemap Enumeration ──────────────────────────────────────────────────────

/**
 * Fetch all 11 WordPress item sitemaps and return every church URL.
 * No rate limiting needed — only 11 sitemap requests.
 */
async function getAllChurchUrls(): Promise<string[]> {
  const urls: string[] = [];
  for (let i = 1; i <= SITEMAP_COUNT; i++) {
    const sitemapUrl = `${SITE_BASE}/wp-sitemap-posts-item-${i}.xml`;
    console.log(`Fetching sitemap ${i}/${SITEMAP_COUNT}...`);
    const xml = await fetchHtml(sitemapUrl);
    const matches = xml.matchAll(/<loc>(https:\/\/discovermass\.com\/church\/[^<]+)<\/loc>/g);
    for (const match of matches) {
      urls.push(match[1]);
    }
  }
  console.log(`Total church URLs: ${urls.length}`);
  return urls;
}

// ─── DB Helpers ───────────────────────────────────────────────────────────────

/**
 * Load all US churches from DB into memory for dedup matching.
 * Only loads US churches to keep the array manageable.
 */
async function loadExistingChurches(): Promise<ExistingChurch[]> {
  console.log('Loading existing US churches from DB...');
  const churches = await prisma.church.findMany({
    where: { country: 'US' },
    select: {
      id: true, name: true, latitude: true, longitude: true,
      osmId: true, baiduId: true, masstimesId: true,
      orarimesseId: true, massSchedulesPhId: true, philmassId: true,
      horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true,
      messesInfoId: true, bohosluzbyId: true, miserendId: true,
      kerknetId: true, gottesdienstzeitenId: true, discovermassId: true,
      source: true, website: true, phone: true, address: true, country: true,
    },
  });
  console.log(`Loaded ${churches.length} existing US churches`);
  return churches as ExistingChurch[];
}

Task 6: Add processChurch + main() + CLI parsing

Files:

  • Modify: scripts/import-discovermass.ts

  • Step 1: Add processChurch function

Append to the file:

// ─── Church Processing ────────────────────────────────────────────────────────

async function processChurch(
  url: string,
  existingChurches: ExistingChurch[],
  args: CLIArgs,
  stats: ImportStats,
): Promise<void> {
  const slug = url.replace('https://discovermass.com/church/', '').replace(/\/$/, '');
  stats.total++;

  try {
    const html = await fetchHtml(url);
    const parsed = parseChurch(html);
    if (!parsed) {
      console.log(`  [skip] Could not parse: ${slug}`);
      stats.skipped++;
      return;
    }

    const masses = parseMassTimes(html);
    const { confessions, adorations } = parseOtherServices(html);

    if (args.dryRun) {
      console.log(`  [dry-run] ${parsed.name}${masses.length} masses, ${confessions.length} confessions, ${adorations.length} adorations`);
      return;
    }

    const candidate = {
      name: parsed.name,
      lat: parsed.lat,
      lng: parsed.lng,
      discovermassId: slug,
    };
    const duplicate = findDuplicateChurch(candidate, existingChurches);

    if (duplicate) {
      // Update existing church — only fill blank fields, always replace schedules
      const updateData: Record<string, unknown> = { discovermassId: slug };
      if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone;
      if (!duplicate.website && parsed.website) {
        updateData.website = parsed.website;
        updateData.hasWebsite = true;
      }
      if (parsed.lat !== 0 && duplicate.latitude === 0) {
        updateData.latitude = parsed.lat;
        updateData.longitude = parsed.lng;
      }

      try {
        await prisma.$transaction(async (tx) => {
          await tx.church.update({ where: { id: duplicate.id }, data: updateData });

          if (masses.length > 0) {
            await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } });
            await tx.massSchedule.createMany({
              data: masses.map(m => ({
                churchId: duplicate.id,
                dayOfWeek: m.dayOfWeek,
                time: m.time,
                language: m.language,
                notes: m.notes ?? null,
              })),
            });
          }

          if (confessions.length > 0) {
            await tx.confessionSchedule.deleteMany({ where: { churchId: duplicate.id } });
            await tx.confessionSchedule.createMany({
              data: confessions.map(c => ({
                churchId: duplicate.id,
                dayOfWeek: c.dayOfWeek,
                startTime: c.startTime,
                endTime: c.endTime,
                notes: c.notes ?? null,
              })),
            });
          }

          if (adorations.length > 0) {
            await tx.adorationSchedule.deleteMany({ where: { churchId: duplicate.id } });
            await tx.adorationSchedule.createMany({
              data: adorations.map(a => ({
                churchId: duplicate.id,
                dayOfWeek: a.dayOfWeek,
                startTime: a.startTime,
                endTime: a.endTime,
                notes: a.notes ?? null,
              })),
            });
          }

          await tx.church.update({
            where: { id: duplicate.id },
            data: { lastScrapedAt: new Date() },
          });
        });

        // Update in-memory entry for within-run dedup
        duplicate.discovermassId = slug;
        stats.updated++;
      } catch (err) {
        if (err instanceof Error && err.message.includes('Unique constraint')) {
          stats.skipped++;
          return;
        }
        throw err;
      }
    } else {
      // Create new church
      try {
        const church = await prisma.church.create({
          data: {
            name: parsed.name,
            address: parsed.address,
            city: parsed.city,
            state: parsed.state,
            zip: parsed.zip,
            country: 'US',
            phone: parsed.phone,
            website: parsed.website,
            hasWebsite: !!parsed.website,
            latitude: parsed.lat,
            longitude: parsed.lng,
            discovermassId: slug,
            source: 'discovermass',
          },
        });

        // Add to in-memory array for within-run dedup
        existingChurches.push({
          id: church.id,
          name: parsed.name,
          latitude: parsed.lat,
          longitude: parsed.lng,
          osmId: null,
          baiduId: null,
          masstimesId: null,
          orarimesseId: null,
          massSchedulesPhId: null,
          philmassId: null,
          horariosMisasId: null,
          mszeInfoId: null,
          weekdayMassesId: null,
          messesInfoId: null,
          bohosluzbyId: null,
          miserendId: null,
          kerknetId: null,
          gottesdienstzeitenId: null,
          discovermassId: slug,
          source: 'discovermass',
          website: parsed.website,
          phone: parsed.phone,
          address: parsed.address,
          country: 'US',
        });

        if (masses.length > 0) {
          await prisma.massSchedule.createMany({
            data: masses.map(m => ({
              churchId: church.id,
              dayOfWeek: m.dayOfWeek,
              time: m.time,
              language: m.language,
              notes: m.notes ?? null,
            })),
          });
        }

        if (confessions.length > 0) {
          await prisma.confessionSchedule.createMany({
            data: confessions.map(c => ({
              churchId: church.id,
              dayOfWeek: c.dayOfWeek,
              startTime: c.startTime,
              endTime: c.endTime,
              notes: c.notes ?? null,
            })),
          });
        }

        if (adorations.length > 0) {
          await prisma.adorationSchedule.createMany({
            data: adorations.map(a => ({
              churchId: church.id,
              dayOfWeek: a.dayOfWeek,
              startTime: a.startTime,
              endTime: a.endTime,
              notes: a.notes ?? null,
            })),
          });
        }

        await prisma.church.update({
          where: { id: church.id },
          data: { lastScrapedAt: new Date() },
        });

        stats.created++;
      } catch (err) {
        if (err instanceof Error && err.message.includes('Unique constraint')) {
          stats.skipped++;
          return;
        }
        throw err;
      }
    }

    stats.massSchedulesCreated += masses.length;
    stats.confessionSchedulesCreated += confessions.length;
    stats.adorationSchedulesCreated += adorations.length;

    console.log(
      `  [${duplicate ? 'update' : 'create'}] ${parsed.name} — ` +
      `${masses.length}M ${confessions.length}C ${adorations.length}A — ` +
      `${stats.total} total (${stats.created} new, ${stats.updated} upd, ${stats.errors} err)`
    );
  } catch (err) {
    stats.errors++;
    console.error(`  [error] ${slug}: ${err instanceof Error ? err.message : err}`);
  }
}
  • Step 2: Add parseCLIArgs + main()

Append to the file:

// ─── CLI Parsing ──────────────────────────────────────────────────────────────

function parseCLIArgs(): CLIArgs {
  const args = process.argv.slice(2);
  const result: CLIArgs = { all: false, dryRun: false };
  for (let i = 0; i < args.length; i++) {
    switch (args[i]) {
      case '--all': result.all = true; break;
      case '--dry-run': result.dryRun = true; break;
      case '--resume-from': result.resumeFrom = parseInt(args[++i], 10); break;
      case '--job-id': result.jobId = args[++i]; break;
    }
  }
  return result;
}

// ─── Main ─────────────────────────────────────────────────────────────────────

async function main() {
  const args = parseCLIArgs();

  if (!args.all) {
    console.error('Usage: npx tsx scripts/import-discovermass.ts --all [--dry-run] [--resume-from N] [--job-id UUID]');
    process.exit(1);
  }

  // Update job status to 'running' if job-id provided
  if (args.jobId) {
    try {
      await prisma.backgroundJob.update({
        where: { id: args.jobId },
        data: { status: 'running', startedAt: new Date() },
      });
    } catch { /* Job might not exist yet */ }
  }

  const stats: ImportStats = {
    total: 0, created: 0, updated: 0, skipped: 0, errors: 0,
    massSchedulesCreated: 0, confessionSchedulesCreated: 0, adorationSchedulesCreated: 0,
  };

  try {
    // Step 1: Enumerate all church URLs from sitemaps
    const urls = await getAllChurchUrls();

    // Step 2: Load existing US churches for dedup
    const existingChurches = await loadExistingChurches();

    // Step 3: Apply resume-from offset
    const startIdx = args.resumeFrom ?? 0;
    const churchUrls = urls.slice(startIdx);
    console.log(`\nProcessing ${churchUrls.length} churches (starting from index ${startIdx})...\n`);

    // Step 4: Process each church with 10s delay between requests
    for (let i = 0; i < churchUrls.length; i++) {
      const url = churchUrls[i];
      const overallIdx = startIdx + i;
      console.log(`[${overallIdx + 1}/${urls.length}] ${url}`);

      await processChurch(url, existingChurches, args, stats);

      // Rate limit: 10s delay between church pages (robots.txt Crawl-delay: 10)
      if (i < churchUrls.length - 1) {
        await sleep(REQUEST_DELAY_MS);
      }
    }
  } finally {
    // Always print stats and update job status
    console.log('\n─── Import Complete ───────────────────────────────────────');
    console.log(`Total processed:    ${stats.total}`);
    console.log(`Created:            ${stats.created}`);
    console.log(`Updated:            ${stats.updated}`);
    console.log(`Skipped:            ${stats.skipped}`);
    console.log(`Errors:             ${stats.errors}`);
    console.log(`Mass schedules:     ${stats.massSchedulesCreated}`);
    console.log(`Confession sched:   ${stats.confessionSchedulesCreated}`);
    console.log(`Adoration sched:    ${stats.adorationSchedulesCreated}`);

    if (args.jobId) {
      const status = stats.errors > stats.total * 0.1 ? 'failed' : 'completed';
      try {
        await prisma.backgroundJob.update({
          where: { id: args.jobId },
          data: {
            status,
            completedAt: new Date(),
            processed: stats.total,
            succeeded: stats.created + stats.updated,
            failed: stats.errors,
            itemsFound: stats.massSchedulesCreated,
          },
        });
      } catch { /* Ignore */ }
    }

    await prisma.$disconnect();
    await pool.end();
  }
}

main().catch((err) => {
  console.error('Fatal error:', err);
  process.exit(1);
});
  • Step 3: Run dry-run on first 3 churches
cd /home/albert/Documents/ScraperControl
# Get a few URLs from the first sitemap to test with
curl -s https://discovermass.com/wp-sitemap-posts-item-1.xml | grep -o '<loc>[^<]*</loc>' | head -3

Then test dry-run:

npx tsx scripts/import-discovermass.ts --all --dry-run --resume-from 0

Wait ~30 seconds (3 churches × 10s delay). Expected output:

Connecting to database: postgresql://...
Fetching sitemap 1/11...
...
Total church URLs: 20284
Loading existing US churches from DB...
Loaded XXXX existing US churches

Processing 20284 churches (starting from index 0)...

[1/20284] https://discovermass.com/church/some-church/
  [dry-run] Some Church — 8 masses, 2 confessions, 3 adorations
[2/20284] ...

Stop with Ctrl+C after a few churches.

  • Step 4: Verify TypeScript compiles
npx tsc --noEmit

Fix any type errors before committing.

  • Step 5: Commit
git add scripts/import-discovermass.ts
git commit -m "feat: add import-discovermass.ts — USA church importer with 10s crawl delay"

Chunk 4: Integration + Docker deployment

Task 7: package.json + scheduler integration

Files:

  • Modify: package.json

  • Modify: scripts/scheduler.ts

  • Step 1: Add import:discovermass to package.json

Open package.json. Find the "scripts" section. After "import:masstimes-api" (or whichever is last in the import group), add:

"import:discovermass": "tsx scripts/import-discovermass.ts",
  • Step 2: Add discovermass-import case to getJobCommand in scheduler.ts

Open scripts/scheduler.ts. Find the getJobCommand function. After the masstimes-api-import case block, add:

case 'discovermass-import': {
  const args = ['tsx', 'scripts/import-discovermass.ts', '--all'];
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
  // Note: --job-id is appended by startJobProcess() in the scheduler, not here.
}
  • Step 3: Add discovermass-import to PIPELINE_GROUPS

In scripts/scheduler.ts, find PIPELINE_GROUPS. In the first group's phases array, after the masstimes-api-import entry, add:

{ name: 'discovermass-import', type: 'discovermass-import', config: {} },
  • Step 4: Verify TypeScript compiles
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit
  • Step 5: Commit
git add package.json scripts/scheduler.ts
git commit -m "feat: add discovermass-import to scheduler pipeline and package.json"

Task 8: Deploy to Docker and run

The importer will run for ~56 hours in the Docker scheduler container. The scheduler picks it up as part of the PIPELINE_GROUPS sequence.

  • Step 1: Deploy to Docker directory
bash /home/albert/Documents/ScraperControl/scripts/deploy-local.sh

Or manually:

rsync -avz --exclude node_modules --exclude .next --exclude '.env*' \
  --exclude .git --exclude .claude --exclude .playwright-mcp \
  ~/Documents/ScraperControl/ /opt/docker/scraper-control/
  • Step 2: Rebuild Docker images to pick up new script
cd /opt/docker/scraper-control
docker compose build scheduler

Expected: build completes without errors.

  • Step 3: Create a manual job via the admin API to trigger the import immediately

The scheduler can run imports as manual jobs (priority over pipeline):

curl -X POST http://localhost:3001/api/admin/jobs \
  -H "Content-Type: application/json" \
  -H "X-Admin-Key: $(grep ADMIN_API_KEY /opt/docker/scraper-control/.env | cut -d= -f2)" \
  -d '{"type": "discovermass-import", "config": {}}'

Expected: {"id": "...", "status": "pending", ...}

  • Step 4: Restart the scheduler to pick up the new job
cd /opt/docker/scraper-control
docker compose restart scheduler
  • Step 5: Monitor the job

Check logs:

docker compose logs -f scheduler --since 1m

Expected output:

Fetching sitemap 1/11...
...
Total church URLs: 20284
Loading existing US churches from DB...
[1/20284] https://discovermass.com/church/...
  [create] St. Paul the Apostle — 14M 4C 6A — 1 total (0 new, 1 upd, 0 err)

The St. Paul the Apostle church we seeded earlier should show as [update] (matched by name+proximity), linking the discovermassId to the existing record.

  • Step 6: Verify St. Paul the Apostle was matched (not duplicated)

After the importer processes the Chino Hills area (~a few hours in since it's alphabetical by state/slug), run:

psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass \
  -c "SELECT name, city, state, discovermass_id, created_at FROM churches WHERE name ILIKE '%Paul%' AND city = 'Chino Hills';"

Expected: 1 row with discovermass_id = 'st-paul-the-apostle-chino-hills-ca' and created_at from the earlier seed (not a new timestamp — it's an update, not a create).

  • Step 7: Let the full run complete (~56 hours)

The scheduler will log progress. You can check status anytime:

docker compose logs scheduler --since 10m | tail -20

Or via the admin dashboard at http://192.168.0.241:3001 — the job will appear in the Jobs tab with status running and progress tracked in the processed/succeeded/failed fields.

Final expected stats after completion:

Total processed:    20284
Created:            ~17000-19000 (new US churches)
Updated:            ~1000-3000 (matched against OSM/MassTimes churches)
Errors:             < 50 (network blips)
Mass schedules:     ~70000-90000
Confession sched:   ~30000-50000
Adoration sched:    ~10000-20000