Files
ScraperControl/docs/superpowers/plans/2026-03-10-discovermass-importer.md
albertfj114 6e9ada7fdf fix: harden discovermass plan against coord validation and regex slowdown
- Validate lat/lng from daddr= (bounds check + isFinite) before storing
- Cap HTML to 100KB before regex matching to prevent backtracking on large pages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 22:34:51 -04:00

1232 lines
41 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DiscoverMass.com Importer Implementation Plan
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Import 20,284 US Catholic churches with mass/confession/adoration schedules from discovermass.com into the NearestMass database.
**Architecture:** Enumerate 11 WordPress sitemaps → fetch each church page at 10s intervals (respecting Crawl-delay) → parse server-rendered HTML for name/address/coordinates/schedules → match against existing US churches via church-matcher → upsert with full schedule data.
**Tech Stack:** TypeScript/tsx, Prisma 7 + PrismaPg adapter, pg Pool, Node.js `fetch`, regex HTML parsing (no DOM library needed — HTML is server-rendered and predictable).
---
## Chunk 1: Schema + church-matcher
### Task 1: Add discovermassId to schema
**Files:**
- Modify: `prisma/schema.prisma`
The schema lives in this repo but migrations run in BethelGuide. After editing schema.prisma here, run `npx prisma generate` to regenerate the Prisma client. Do NOT run `prisma migrate`.
- [ ] **Step 1: Find the right place in schema.prisma**
Open `prisma/schema.prisma`. Find the block of source ID fields — they look like:
```prisma
gottesdienstzeitenId String? @unique @map("gottesdienstzeiten_id")
```
This is inside the `model Church { ... }` block, after `kerknetId` and before `claimed`.
- [ ] **Step 2: Add discovermassId field**
After `gottesdienstzeitenId`:
```prisma
discovermassId String? @unique @map("discovermass_id")
```
Also find the `@@index` block near the bottom of the Church model (it groups all the index definitions). Add:
```prisma
@@index([discovermassId])
```
- [ ] **Step 3: Regenerate Prisma client**
```bash
cd /home/albert/Documents/ScraperControl
npx prisma generate
```
Expected output: `✔ Generated Prisma Client` (no errors). This does NOT touch the database — it only updates the TypeScript client.
- [ ] **Step 4: Apply migration to database**
The schema source of truth is BethelGuide. Run the migration there, then sync back. Since we're on the same dev server:
```bash
# Check if discovermass_id column already exists (it shouldn't yet)
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "\d churches" | grep discovermass
```
If the column doesn't exist, apply it directly:
```bash
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "
ALTER TABLE churches ADD COLUMN IF NOT EXISTS discovermass_id VARCHAR UNIQUE;
CREATE INDEX IF NOT EXISTS churches_discovermass_id_idx ON churches(discovermass_id);
"
```
Expected output: `ALTER TABLE` and `CREATE INDEX`
- [ ] **Step 5: Verify column exists**
```bash
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "\d churches" | grep discovermass
```
Expected output: `discovermass_id | character varying | ...`
- [ ] **Step 6: Commit**
```bash
cd /home/albert/Documents/ScraperControl
git add prisma/schema.prisma
git commit -m "feat: add discovermassId field to Church schema"
```
---
### Task 2: Update church-matcher
**Files:**
- Modify: `src/lib/church-matcher.ts`
The `ExistingChurch` interface (line ~11) lists all source IDs. The `ChurchCandidate` type (line ~122) lists optional source IDs for the candidate. The `findDuplicateChurch` function has sequential passes checking each ID before falling back to proximity+name.
- [ ] **Step 1: Add discovermassId to ExistingChurch interface**
Find the `export interface ExistingChurch {` block. After the `gottesdienstzeitenId` line, add:
```typescript
discovermassId: string | null;
```
- [ ] **Step 2: Add discovermassId to ChurchCandidate type**
Find `export type ChurchCandidate = {`. After `gottesdienstzeitenId?: string;`, add:
```typescript
discovermassId?: string;
```
- [ ] **Step 3: Add discovermassId matching pass in findDuplicateChurch**
Find the `findDuplicateChurch` function. It has a series of passes like:
```typescript
if (candidate.gottesdienstzeitenId) {
const match = existingChurches.find(c => c.gottesdienstzeitenId === candidate.gottesdienstzeitenId);
if (match) return match;
}
// Proximity + name similarity
```
Add a new pass BEFORE the proximity+name pass (after gottesdienstzeitenId):
```typescript
if (candidate.discovermassId) {
const match = existingChurches.find(c => c.discovermassId === candidate.discovermassId);
if (match) return match;
}
```
- [ ] **Step 4: Update all callers that construct ExistingChurch objects**
Search for places that build ExistingChurch objects (the in-memory push after creating a new church). Each importer has a block like:
```typescript
existingChurches.push({
id: newChurch.id,
...
gottesdienstzeitenId: null,
...
});
```
Run:
```bash
grep -rn "gottesdienstzeitenId: null" scripts/
```
For each file found: add `discovermassId: null,` after `gottesdienstzeitenId: null,`. These are the in-memory dedup arrays — they need the new field or TypeScript will complain.
Also update the `loadExistingChurches` select queries if any importer has one (check with `grep -rn "gottesdienstzeitenId: true" scripts/`).
- [ ] **Step 5: Verify TypeScript compiles**
```bash
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit
```
Expected: no errors. Fix any type errors (they'll be missing `discovermassId` fields).
- [ ] **Step 6: Commit**
```bash
# Stage church-matcher AND all importer scripts that were updated in Step 4
git add src/lib/church-matcher.ts
git add scripts/
git commit -m "feat: add discovermassId to church-matcher ExistingChurch and ChurchCandidate"
```
---
## Chunk 2: import-discovermass.ts — utilities and parsing
### Task 3: Create file skeleton + utilities
**Files:**
- Create: `scripts/import-discovermass.ts`
- [ ] **Step 1: Create the file with header, imports, constants, types**
Create `scripts/import-discovermass.ts` with this content:
```typescript
#!/usr/bin/env tsx
/**
* Import Catholic churches and mass schedules from discovermass.com (USA)
*
* discovermass.com is a US Catholic church directory with 20,284 churches.
* Data includes name, address, phone, website, coordinates, mass times,
* confessions, and adoration schedules.
*
* robots.txt specifies Crawl-delay: 10 — this importer follows that rule.
*
* Usage:
* npx tsx scripts/import-discovermass.ts --all
* npx tsx scripts/import-discovermass.ts --all --dry-run
* npx tsx scripts/import-discovermass.ts --all --resume-from 5000
* npx tsx scripts/import-discovermass.ts --all --job-id {uuid}
*/
import dotenv from 'dotenv';
import path from 'path';
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
connectionString: dbUrl,
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });
import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';
// ─── Constants ───────────────────────────────────────────────────────────────
const SITE_BASE = 'https://discovermass.com';
const SITEMAP_COUNT = 11;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 10_000; // Crawl-delay: 10 from robots.txt
// ─── Types ───────────────────────────────────────────────────────────────────
interface ParsedChurch {
name: string;
address: string | null;
city: string | null;
state: string | null;
zip: string | null;
phone: string | null;
website: string | null;
lat: number;
lng: number;
}
interface ParsedMass {
dayOfWeek: number; // 0=Sun, 1=Mon, ..., 6=Sat
time: string; // HH:MM 24-hour
language: string;
notes?: string;
}
interface ParsedConf {
dayOfWeek: number;
startTime: string; // HH:MM 24-hour
endTime: string; // HH:MM 24-hour
notes?: string;
}
interface ParsedAdoration {
dayOfWeek: number;
startTime: string; // HH:MM 24-hour
endTime: string; // HH:MM 24-hour
notes?: string;
}
interface ImportStats {
total: number;
created: number;
updated: number;
skipped: number;
errors: number;
massSchedulesCreated: number;
confessionSchedulesCreated: number;
adorationSchedulesCreated: number;
}
interface CLIArgs {
all: boolean;
dryRun: boolean;
resumeFrom?: number;
jobId?: string;
}
```
- [ ] **Step 2: Add day mappings and time utilities**
Append to the file:
```typescript
// ─── Day Mappings ─────────────────────────────────────────────────────────────
// Full day names used in mass schedule <li> labels
const FULL_DAY_NAMES: Record<string, number> = {
Sunday: 0, Monday: 1, Tuesday: 2, Wednesday: 3,
Thursday: 4, Friday: 5, Saturday: 6,
};
// Abbreviated day prefixes used in confession/adoration serviceTime text
const ABBREV_DAY_NAMES: Record<string, number[]> = {
Sun: [0], Mon: [1], Tue: [2], Wed: [3],
Thr: [4], Thu: [4], Fri: [5], Sat: [6],
Weekdays: [1, 2, 3, 4, 5],
Daily: [0, 1, 2, 3, 4, 5, 6],
};
// ─── Time Utilities ───────────────────────────────────────────────────────────
/**
* Convert "5:00pm", "11:00am", "12:00pm", "12:00am" to "HH:MM" 24-hour format.
* Returns the original string unchanged if it doesn't match expected format.
*/
function convertTo24h(timeStr: string): string {
const cleaned = timeStr.trim().toLowerCase();
const m = cleaned.match(/^(\d{1,2}):(\d{2})(am|pm)$/);
if (!m) return cleaned;
let hours = parseInt(m[1], 10);
const mins = m[2];
const meridiem = m[3];
if (meridiem === 'pm' && hours !== 12) hours += 12;
if (meridiem === 'am' && hours === 12) hours = 0;
return `${String(hours).padStart(2, '0')}:${mins}`;
}
/**
* Parse "8:30am-9:00am" → ["08:30", "09:00"].
* Handles the case where both sides need to infer AM/PM from context.
* E.g. "8:30am-9:00am" → both explicit. "9:00am-6:00pm" → both explicit.
*/
function parseTimeRange(rangeStr: string): [string, string] {
// Split on '-' but careful: times may contain only one '-' between start and end
// Pattern: "8:30am-9:00am" or "3:30pm-4:30pm"
const hyphenIdx = rangeStr.indexOf('-', rangeStr.indexOf(':') + 1);
if (hyphenIdx === -1) {
const t = convertTo24h(rangeStr.trim());
return [t, t];
}
const start = convertTo24h(rangeStr.slice(0, hyphenIdx).trim());
const end = convertTo24h(rangeStr.slice(hyphenIdx + 1).trim());
return [start, end];
}
/**
* Expand abbreviated day prefix to array of dayOfWeek integers.
* Returns empty array if prefix is not recognized.
*/
function expandDayAbbrev(prefix: string): number[] {
return ABBREV_DAY_NAMES[prefix] ?? [];
}
// ─── Address Parsing ──────────────────────────────────────────────────────────
/**
* Parse "14085 Peyton Drive, Chino Hills, CA 91709" into components.
* Returns partial result on malformed input.
*/
function parseAddress(raw: string): { address: string | null; city: string | null; state: string | null; zip: string | null } {
const parts = raw.split(', ');
if (parts.length < 3) return { address: raw, city: null, state: null, zip: null };
const last = parts[parts.length - 1].trim();
const stateZipMatch = last.match(/^([A-Z]{2})\s+(\d{5}(?:-\d{4})?)$/);
if (!stateZipMatch) return { address: raw, city: null, state: null, zip: null };
return {
address: parts.slice(0, parts.length - 2).join(', ').trim(),
city: parts[parts.length - 2].trim(),
state: stateZipMatch[1],
zip: stateZipMatch[2],
};
}
```
- [ ] **Step 3: Verify utilities compile**
```bash
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit
```
Expected: no errors related to import-discovermass.ts. Other files may have pre-existing errors — focus only on this file's errors.
---
### Task 4: Add HTML parsing functions
**Files:**
- Modify: `scripts/import-discovermass.ts`
The HTML is server-rendered. The page structure has:
- `<meta property="og:title">` for church name
- US address embedded as text in a known pattern
- `<div id="sidebar-info">` for phone/website/coordinates
- Two `<ul>` blocks: one containing `<h5>Mass Times</h5>`, another containing `<h5>Other Services</h5>`
- [ ] **Step 1: Add parseChurch function**
Append to the file:
```typescript
// ─── HTML Parsing ─────────────────────────────────────────────────────────────
/**
* Parse church metadata from page HTML.
* Returns null if the page doesn't look like a valid church listing.
*/
function parseChurch(html: string): ParsedChurch | null {
// Name from OpenGraph meta tag
const nameMatch = html.match(/<meta property="og:title" content="([^"]+)"/);
if (!nameMatch) return null;
const name = nameMatch[1].trim();
if (!name || name === 'Discover Mass') return null;
// Address: match US address pattern (number + street, city, STATE ZIP)
let address: string | null = null;
let city: string | null = null;
let state: string | null = null;
let zip: string | null = null;
const addrMatch = html.match(/(\d+[^<\n,]+),\s*([^<,\n]+),\s*([A-Z]{2})\s+(\d{5}(?:-\d{4})?)/);
if (addrMatch) {
const raw = `${addrMatch[1].trim()}, ${addrMatch[2].trim()}, ${addrMatch[3]} ${addrMatch[4]}`;
const parsed = parseAddress(raw);
address = parsed.address;
city = parsed.city;
state = parsed.state;
zip = parsed.zip;
}
// Phone from sidebar
const phoneMatch = html.match(/<span class='side-phone attribute'>([^<]+)<\/span>/);
const phone = phoneMatch ? phoneMatch[1].trim() : null;
// Website from sidebar
const websiteMatch = html.match(/<span class='side-website attribute'><a href='([^']+)'/);
const website = websiteMatch ? websiteMatch[1].trim() : null;
// Coordinates from Google Maps daddr parameter
let lat = 0;
let lng = 0;
const coordMatch = html.match(/daddr=([-\d.]+),([-\d.]+)/);
if (coordMatch) {
const rawLat = parseFloat(coordMatch[1]);
const rawLng = parseFloat(coordMatch[2]);
// Validate: reject NaN, Infinity, and out-of-range values; fall back to 0 sentinel
if (isFinite(rawLat) && isFinite(rawLng) && Math.abs(rawLat) <= 90 && Math.abs(rawLng) <= 180) {
lat = rawLat;
lng = rawLng;
}
}
return { name, address, city, state, zip, phone, website, lat, lng };
}
```
- [ ] **Step 2: Add parseMassTimes function**
Append to the file:
```typescript
/**
* Parse the mass schedule from the "Mass Times" <ul> block.
*
* HTML structure:
* <ul><li><h5>Mass Times</h5></li>
* <li class=""><span class="label">Saturday</span>
* <span class='serviceTime'><span class='time'>5:00pm</span></span>,
* <span class='serviceTime'><span class='time'>7:00pm</span> <span class='language'>(Vietnamese)</span></span>
* </li>
* </ul>
*/
function parseMassTimes(html: string): ParsedMass[] {
// Cap HTML to first 100KB to prevent slow regex backtracking on malformed pages
const safeHtml = html.length > 100_000 ? html.slice(0, 100_000) : html;
const massUlMatch = safeHtml.match(/<ul>\s*<li>\s*<h5>Mass Times<\/h5>[\s\S]*?<\/ul>/);
if (!massUlMatch) return [];
const massUl = massUlMatch[0];
const results: ParsedMass[] = [];
// Each day is in a <li ...> (the first li has the h5 header, skip it).
// Use regex split to handle any class value on the li, not just empty class.
const liParts = massUl.split(/<li[^>]*>/);
for (let i = 1; i < liParts.length; i++) {
const li = liParts[i];
const labelMatch = li.match(/<span class="label">([^<]+)<\/span>/);
if (!labelMatch) continue;
const dayLabel = labelMatch[1].trim();
const dayOfWeek = FULL_DAY_NAMES[dayLabel];
if (dayOfWeek === undefined) continue;
// Each time entry is in a <span class='serviceTime'>
const serviceTimeParts = li.split("<span class='serviceTime'>");
for (let j = 1; j < serviceTimeParts.length; j++) {
const st = serviceTimeParts[j];
const timeMatch = st.match(/<span class='time'>([^<]+)<\/span>/);
if (!timeMatch) continue;
const time = convertTo24h(timeMatch[1].trim());
const langMatch = st.match(/<span class='language'>\(([^)]+)\)<\/span>/);
const language = langMatch ? langMatch[1].trim() : 'English';
const commentMatch = st.match(/<span class='comment'>([^<]+)<\/span>/);
const notes = commentMatch ? commentMatch[1].trim() : undefined;
results.push({ dayOfWeek, time, language, notes });
}
}
return results;
}
```
- [ ] **Step 3: Add parseOtherServices function**
Append to the file:
```typescript
/**
* Parse confessions and adoration from the "Other Services" <ul> block.
*
* HTML structure:
* <ul><li><h5>Other Services</h5></li>
* <li class="Confessions"><span class="label">Confessions</span>
* <span class='serviceTime'>Tue: <span class='time'>8:30am-9:00am</span></span>
* <span class='serviceTime'>Sat: <span class='time'>3:30pm-4:30pm</span> - <span class='comment'>In the Chapel</span></span>
* </li>
* <li class="Adoration"><span class="label">Adoration</span>
* <span class='serviceTime'>Weekdays: <span class='time'>9:00am-6:00pm</span> - <span class='comment'>In the Chapel</span></span>
* </li>
* </ul>
*/
function parseOtherServices(html: string): { confessions: ParsedConf[]; adorations: ParsedAdoration[] } {
// Cap HTML to first 100KB to prevent slow regex backtracking on malformed pages
const safeHtml = html.length > 100_000 ? html.slice(0, 100_000) : html;
const otherUlMatch = safeHtml.match(/<ul>\s*<li>\s*<h5>Other Services<\/h5>[\s\S]*?<\/ul>/);
if (!otherUlMatch) return { confessions: [], adorations: [] };
const otherUl = otherUlMatch[0];
function parseServiceItems(liHtml: string): Array<{ dayOfWeek: number; startTime: string; endTime: string; notes?: string }> {
const items: Array<{ dayOfWeek: number; startTime: string; endTime: string; notes?: string }> = [];
const stParts = liHtml.split("<span class='serviceTime'>");
for (let i = 1; i < stParts.length; i++) {
const st = stParts[i];
// Text before <span class='time'> contains the day abbreviation and colon
const dayTimeMatch = st.match(/^([A-Za-z]+):\s*<span class='time'>([^<]+)<\/span>/);
if (!dayTimeMatch) continue;
const days = expandDayAbbrev(dayTimeMatch[1].trim());
if (days.length === 0) continue;
const [startTime, endTime] = parseTimeRange(dayTimeMatch[2]);
const commentMatch = st.match(/<span class='comment'>([^<]+)<\/span>/);
const notes = commentMatch ? commentMatch[1].trim() : undefined;
for (const dayOfWeek of days) {
items.push({ dayOfWeek, startTime, endTime, notes });
}
}
return items;
}
const confessions: ParsedConf[] = [];
const adorations: ParsedAdoration[] = [];
const confMatch = otherUl.match(/<li class="Confessions">[\s\S]*?<\/li>/);
if (confMatch) confessions.push(...parseServiceItems(confMatch[0]));
const adorMatch = otherUl.match(/<li class="Adoration">[\s\S]*?<\/li>/);
if (adorMatch) adorations.push(...parseServiceItems(adorMatch[0]));
return { confessions, adorations };
}
```
- [ ] **Step 4: Smoke-test parsing on a real page**
Create a quick test at the end of the file temporarily:
```typescript
// TEMP: smoke test — remove before committing
if (process.argv[2] === '--test-parse') {
const testUrl = 'https://discovermass.com/church/st-paul-the-apostle-chino-hills-ca/';
const html = await (await fetch(testUrl, { headers: { 'User-Agent': USER_AGENT } })).text();
const church = parseChurch(html);
const masses = parseMassTimes(html);
const { confessions, adorations } = parseOtherServices(html);
console.log('Church:', church);
console.log('Masses:', masses.length, masses.slice(0, 3));
console.log('Confessions:', confessions.length, confessions);
console.log('Adorations:', adorations.length, adorations);
process.exit(0);
}
```
Run it:
```bash
cd /home/albert/Documents/ScraperControl
npx tsx scripts/import-discovermass.ts --test-parse
```
Expected output:
```
Church: { name: 'St. Paul the Apostle', address: '14085 Peyton Drive', city: 'Chino Hills', state: 'CA', zip: '91709', phone: '(909) 465-5503', website: 'http://www.sptacc.org', lat: 33.996887, lng: -117.732407 }
Masses: 14 [ { dayOfWeek: 6, time: '17:00', language: 'English', notes: undefined }, ... ]
Confessions: 4 [...]
Adorations: 6 [...]
```
If the counts don't match, debug the regex patterns against the raw HTML:
```bash
curl -s https://discovermass.com/church/st-paul-the-apostle-chino-hills-ca/ > /tmp/dm-test.html
# Then inspect the relevant sections manually
```
- [ ] **Step 5: Remove the temp test block and commit**
Remove the `if (process.argv[2] === '--test-parse')` block from the file.
```bash
git add scripts/import-discovermass.ts
git commit -m "feat: add discovermass parsing utilities (church, mass, confession, adoration)"
```
---
## Chunk 3: import-discovermass.ts — main import loop
### Task 5: Add HTTP helpers + sitemap enumeration
**Files:**
- Modify: `scripts/import-discovermass.ts`
- [ ] **Step 1: Add HTTP helpers and loadExistingChurches**
Append to the file:
```typescript
// ─── HTTP Helpers ─────────────────────────────────────────────────────────────
async function fetchHtml(url: string): Promise<string> {
const res = await fetch(url, {
headers: { 'User-Agent': USER_AGENT },
});
if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`);
return res.text();
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
// ─── Sitemap Enumeration ──────────────────────────────────────────────────────
/**
* Fetch all 11 WordPress item sitemaps and return every church URL.
* No rate limiting needed — only 11 sitemap requests.
*/
async function getAllChurchUrls(): Promise<string[]> {
const urls: string[] = [];
for (let i = 1; i <= SITEMAP_COUNT; i++) {
const sitemapUrl = `${SITE_BASE}/wp-sitemap-posts-item-${i}.xml`;
console.log(`Fetching sitemap ${i}/${SITEMAP_COUNT}...`);
const xml = await fetchHtml(sitemapUrl);
const matches = xml.matchAll(/<loc>(https:\/\/discovermass\.com\/church\/[^<]+)<\/loc>/g);
for (const match of matches) {
urls.push(match[1]);
}
}
console.log(`Total church URLs: ${urls.length}`);
return urls;
}
// ─── DB Helpers ───────────────────────────────────────────────────────────────
/**
* Load all US churches from DB into memory for dedup matching.
* Only loads US churches to keep the array manageable.
*/
async function loadExistingChurches(): Promise<ExistingChurch[]> {
console.log('Loading existing US churches from DB...');
const churches = await prisma.church.findMany({
where: { country: 'US' },
select: {
id: true, name: true, latitude: true, longitude: true,
osmId: true, baiduId: true, masstimesId: true,
orarimesseId: true, massSchedulesPhId: true, philmassId: true,
horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true,
messesInfoId: true, bohosluzbyId: true, miserendId: true,
kerknetId: true, gottesdienstzeitenId: true, discovermassId: true,
source: true, website: true, phone: true, address: true, country: true,
},
});
console.log(`Loaded ${churches.length} existing US churches`);
return churches as ExistingChurch[];
}
```
---
### Task 6: Add processChurch + main() + CLI parsing
**Files:**
- Modify: `scripts/import-discovermass.ts`
- [ ] **Step 1: Add processChurch function**
Append to the file:
```typescript
// ─── Church Processing ────────────────────────────────────────────────────────
async function processChurch(
url: string,
existingChurches: ExistingChurch[],
args: CLIArgs,
stats: ImportStats,
): Promise<void> {
const slug = url.replace('https://discovermass.com/church/', '').replace(/\/$/, '');
stats.total++;
try {
const html = await fetchHtml(url);
const parsed = parseChurch(html);
if (!parsed) {
console.log(` [skip] Could not parse: ${slug}`);
stats.skipped++;
return;
}
const masses = parseMassTimes(html);
const { confessions, adorations } = parseOtherServices(html);
if (args.dryRun) {
console.log(` [dry-run] ${parsed.name}${masses.length} masses, ${confessions.length} confessions, ${adorations.length} adorations`);
return;
}
const candidate = {
name: parsed.name,
lat: parsed.lat,
lng: parsed.lng,
discovermassId: slug,
};
const duplicate = findDuplicateChurch(candidate, existingChurches);
if (duplicate) {
// Update existing church — only fill blank fields, always replace schedules
const updateData: Record<string, unknown> = { discovermassId: slug };
if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone;
if (!duplicate.website && parsed.website) {
updateData.website = parsed.website;
updateData.hasWebsite = true;
}
if (parsed.lat !== 0 && duplicate.latitude === 0) {
updateData.latitude = parsed.lat;
updateData.longitude = parsed.lng;
}
try {
await prisma.$transaction(async (tx) => {
await tx.church.update({ where: { id: duplicate.id }, data: updateData });
if (masses.length > 0) {
await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } });
await tx.massSchedule.createMany({
data: masses.map(m => ({
churchId: duplicate.id,
dayOfWeek: m.dayOfWeek,
time: m.time,
language: m.language,
notes: m.notes ?? null,
})),
});
}
if (confessions.length > 0) {
await tx.confessionSchedule.deleteMany({ where: { churchId: duplicate.id } });
await tx.confessionSchedule.createMany({
data: confessions.map(c => ({
churchId: duplicate.id,
dayOfWeek: c.dayOfWeek,
startTime: c.startTime,
endTime: c.endTime,
notes: c.notes ?? null,
})),
});
}
if (adorations.length > 0) {
await tx.adorationSchedule.deleteMany({ where: { churchId: duplicate.id } });
await tx.adorationSchedule.createMany({
data: adorations.map(a => ({
churchId: duplicate.id,
dayOfWeek: a.dayOfWeek,
startTime: a.startTime,
endTime: a.endTime,
notes: a.notes ?? null,
})),
});
}
await tx.church.update({
where: { id: duplicate.id },
data: { lastScrapedAt: new Date() },
});
});
// Update in-memory entry for within-run dedup
duplicate.discovermassId = slug;
stats.updated++;
} catch (err) {
if (err instanceof Error && err.message.includes('Unique constraint')) {
stats.skipped++;
return;
}
throw err;
}
} else {
// Create new church
try {
const church = await prisma.church.create({
data: {
name: parsed.name,
address: parsed.address,
city: parsed.city,
state: parsed.state,
zip: parsed.zip,
country: 'US',
phone: parsed.phone,
website: parsed.website,
hasWebsite: !!parsed.website,
latitude: parsed.lat,
longitude: parsed.lng,
discovermassId: slug,
source: 'discovermass',
},
});
// Add to in-memory array for within-run dedup
existingChurches.push({
id: church.id,
name: parsed.name,
latitude: parsed.lat,
longitude: parsed.lng,
osmId: null,
baiduId: null,
masstimesId: null,
orarimesseId: null,
massSchedulesPhId: null,
philmassId: null,
horariosMisasId: null,
mszeInfoId: null,
weekdayMassesId: null,
messesInfoId: null,
bohosluzbyId: null,
miserendId: null,
kerknetId: null,
gottesdienstzeitenId: null,
discovermassId: slug,
source: 'discovermass',
website: parsed.website,
phone: parsed.phone,
address: parsed.address,
country: 'US',
});
if (masses.length > 0) {
await prisma.massSchedule.createMany({
data: masses.map(m => ({
churchId: church.id,
dayOfWeek: m.dayOfWeek,
time: m.time,
language: m.language,
notes: m.notes ?? null,
})),
});
}
if (confessions.length > 0) {
await prisma.confessionSchedule.createMany({
data: confessions.map(c => ({
churchId: church.id,
dayOfWeek: c.dayOfWeek,
startTime: c.startTime,
endTime: c.endTime,
notes: c.notes ?? null,
})),
});
}
if (adorations.length > 0) {
await prisma.adorationSchedule.createMany({
data: adorations.map(a => ({
churchId: church.id,
dayOfWeek: a.dayOfWeek,
startTime: a.startTime,
endTime: a.endTime,
notes: a.notes ?? null,
})),
});
}
await prisma.church.update({
where: { id: church.id },
data: { lastScrapedAt: new Date() },
});
stats.created++;
} catch (err) {
if (err instanceof Error && err.message.includes('Unique constraint')) {
stats.skipped++;
return;
}
throw err;
}
}
stats.massSchedulesCreated += masses.length;
stats.confessionSchedulesCreated += confessions.length;
stats.adorationSchedulesCreated += adorations.length;
console.log(
` [${duplicate ? 'update' : 'create'}] ${parsed.name}` +
`${masses.length}M ${confessions.length}C ${adorations.length}A — ` +
`${stats.total} total (${stats.created} new, ${stats.updated} upd, ${stats.errors} err)`
);
} catch (err) {
stats.errors++;
console.error(` [error] ${slug}: ${err instanceof Error ? err.message : err}`);
}
}
```
- [ ] **Step 2: Add parseCLIArgs + main()**
Append to the file:
```typescript
// ─── CLI Parsing ──────────────────────────────────────────────────────────────
function parseCLIArgs(): CLIArgs {
const args = process.argv.slice(2);
const result: CLIArgs = { all: false, dryRun: false };
for (let i = 0; i < args.length; i++) {
switch (args[i]) {
case '--all': result.all = true; break;
case '--dry-run': result.dryRun = true; break;
case '--resume-from': result.resumeFrom = parseInt(args[++i], 10); break;
case '--job-id': result.jobId = args[++i]; break;
}
}
return result;
}
// ─── Main ─────────────────────────────────────────────────────────────────────
async function main() {
const args = parseCLIArgs();
if (!args.all) {
console.error('Usage: npx tsx scripts/import-discovermass.ts --all [--dry-run] [--resume-from N] [--job-id UUID]');
process.exit(1);
}
// Update job status to 'running' if job-id provided
if (args.jobId) {
try {
await prisma.backgroundJob.update({
where: { id: args.jobId },
data: { status: 'running', startedAt: new Date() },
});
} catch { /* Job might not exist yet */ }
}
const stats: ImportStats = {
total: 0, created: 0, updated: 0, skipped: 0, errors: 0,
massSchedulesCreated: 0, confessionSchedulesCreated: 0, adorationSchedulesCreated: 0,
};
try {
// Step 1: Enumerate all church URLs from sitemaps
const urls = await getAllChurchUrls();
// Step 2: Load existing US churches for dedup
const existingChurches = await loadExistingChurches();
// Step 3: Apply resume-from offset
const startIdx = args.resumeFrom ?? 0;
const churchUrls = urls.slice(startIdx);
console.log(`\nProcessing ${churchUrls.length} churches (starting from index ${startIdx})...\n`);
// Step 4: Process each church with 10s delay between requests
for (let i = 0; i < churchUrls.length; i++) {
const url = churchUrls[i];
const overallIdx = startIdx + i;
console.log(`[${overallIdx + 1}/${urls.length}] ${url}`);
await processChurch(url, existingChurches, args, stats);
// Rate limit: 10s delay between church pages (robots.txt Crawl-delay: 10)
if (i < churchUrls.length - 1) {
await sleep(REQUEST_DELAY_MS);
}
}
} finally {
// Always print stats and update job status
console.log('\n─── Import Complete ───────────────────────────────────────');
console.log(`Total processed: ${stats.total}`);
console.log(`Created: ${stats.created}`);
console.log(`Updated: ${stats.updated}`);
console.log(`Skipped: ${stats.skipped}`);
console.log(`Errors: ${stats.errors}`);
console.log(`Mass schedules: ${stats.massSchedulesCreated}`);
console.log(`Confession sched: ${stats.confessionSchedulesCreated}`);
console.log(`Adoration sched: ${stats.adorationSchedulesCreated}`);
if (args.jobId) {
const status = stats.errors > stats.total * 0.1 ? 'failed' : 'completed';
try {
await prisma.backgroundJob.update({
where: { id: args.jobId },
data: {
status,
completedAt: new Date(),
processed: stats.total,
succeeded: stats.created + stats.updated,
failed: stats.errors,
itemsFound: stats.massSchedulesCreated,
},
});
} catch { /* Ignore */ }
}
await prisma.$disconnect();
await pool.end();
}
}
main().catch((err) => {
console.error('Fatal error:', err);
process.exit(1);
});
```
- [ ] **Step 3: Run dry-run on first 3 churches**
```bash
cd /home/albert/Documents/ScraperControl
# Get a few URLs from the first sitemap to test with
curl -s https://discovermass.com/wp-sitemap-posts-item-1.xml | grep -o '<loc>[^<]*</loc>' | head -3
```
Then test dry-run:
```bash
npx tsx scripts/import-discovermass.ts --all --dry-run --resume-from 0
```
Wait ~30 seconds (3 churches × 10s delay). Expected output:
```
Connecting to database: postgresql://...
Fetching sitemap 1/11...
...
Total church URLs: 20284
Loading existing US churches from DB...
Loaded XXXX existing US churches
Processing 20284 churches (starting from index 0)...
[1/20284] https://discovermass.com/church/some-church/
[dry-run] Some Church — 8 masses, 2 confessions, 3 adorations
[2/20284] ...
```
Stop with Ctrl+C after a few churches.
- [ ] **Step 4: Verify TypeScript compiles**
```bash
npx tsc --noEmit
```
Fix any type errors before committing.
- [ ] **Step 5: Commit**
```bash
git add scripts/import-discovermass.ts
git commit -m "feat: add import-discovermass.ts — USA church importer with 10s crawl delay"
```
---
## Chunk 4: Integration + Docker deployment
### Task 7: package.json + scheduler integration
**Files:**
- Modify: `package.json`
- Modify: `scripts/scheduler.ts`
- [ ] **Step 1: Add import:discovermass to package.json**
Open `package.json`. Find the `"scripts"` section. After `"import:masstimes-api"` (or whichever is last in the import group), add:
```json
"import:discovermass": "tsx scripts/import-discovermass.ts",
```
- [ ] **Step 2: Add discovermass-import case to getJobCommand in scheduler.ts**
Open `scripts/scheduler.ts`. Find the `getJobCommand` function. After the `masstimes-api-import` case block, add:
```typescript
case 'discovermass-import': {
const args = ['tsx', 'scripts/import-discovermass.ts', '--all'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
// Note: --job-id is appended by startJobProcess() in the scheduler, not here.
}
```
- [ ] **Step 3: Add discovermass-import to PIPELINE_GROUPS**
In `scripts/scheduler.ts`, find `PIPELINE_GROUPS`. In the first group's `phases` array, after the `masstimes-api-import` entry, add:
```typescript
{ name: 'discovermass-import', type: 'discovermass-import', config: {} },
```
- [ ] **Step 4: Verify TypeScript compiles**
```bash
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit
```
- [ ] **Step 5: Commit**
```bash
git add package.json scripts/scheduler.ts
git commit -m "feat: add discovermass-import to scheduler pipeline and package.json"
```
---
### Task 8: Deploy to Docker and run
The importer will run for ~56 hours in the Docker scheduler container. The scheduler picks it up as part of the PIPELINE_GROUPS sequence.
- [ ] **Step 1: Deploy to Docker directory**
```bash
bash /home/albert/Documents/ScraperControl/scripts/deploy-local.sh
```
Or manually:
```bash
rsync -avz --exclude node_modules --exclude .next --exclude '.env*' \
--exclude .git --exclude .claude --exclude .playwright-mcp \
~/Documents/ScraperControl/ /opt/docker/scraper-control/
```
- [ ] **Step 2: Rebuild Docker images to pick up new script**
```bash
cd /opt/docker/scraper-control
docker compose build scheduler
```
Expected: build completes without errors.
- [ ] **Step 3: Create a manual job via the admin API to trigger the import immediately**
The scheduler can run imports as manual jobs (priority over pipeline):
```bash
curl -X POST http://localhost:3001/api/admin/jobs \
-H "Content-Type: application/json" \
-H "X-Admin-Key: $(grep ADMIN_API_KEY /opt/docker/scraper-control/.env | cut -d= -f2)" \
-d '{"type": "discovermass-import", "config": {}}'
```
Expected: `{"id": "...", "status": "pending", ...}`
- [ ] **Step 4: Restart the scheduler to pick up the new job**
```bash
cd /opt/docker/scraper-control
docker compose restart scheduler
```
- [ ] **Step 5: Monitor the job**
Check logs:
```bash
docker compose logs -f scheduler --since 1m
```
Expected output:
```
Fetching sitemap 1/11...
...
Total church URLs: 20284
Loading existing US churches from DB...
[1/20284] https://discovermass.com/church/...
[create] St. Paul the Apostle — 14M 4C 6A — 1 total (0 new, 1 upd, 0 err)
```
The St. Paul the Apostle church we seeded earlier should show as `[update]` (matched by name+proximity), linking the `discovermassId` to the existing record.
- [ ] **Step 6: Verify St. Paul the Apostle was matched (not duplicated)**
After the importer processes the Chino Hills area (~a few hours in since it's alphabetical by state/slug), run:
```bash
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass \
-c "SELECT name, city, state, discovermass_id, created_at FROM churches WHERE name ILIKE '%Paul%' AND city = 'Chino Hills';"
```
Expected: 1 row with `discovermass_id = 'st-paul-the-apostle-chino-hills-ca'` and `created_at` from the earlier seed (not a new timestamp — it's an update, not a create).
- [ ] **Step 7: Let the full run complete (~56 hours)**
The scheduler will log progress. You can check status anytime:
```bash
docker compose logs scheduler --since 10m | tail -20
```
Or via the admin dashboard at `http://192.168.0.241:3001` — the job will appear in the Jobs tab with status `running` and progress tracked in the `processed`/`succeeded`/`failed` fields.
Final expected stats after completion:
```
Total processed: 20284
Created: ~17000-19000 (new US churches)
Updated: ~1000-3000 (matched against OSM/MassTimes churches)
Errors: < 50 (network blips)
Mass schedules: ~70000-90000
Confession sched: ~30000-50000
Adoration sched: ~10000-20000
```