Files
ScraperControl/docs/superpowers/plans/2026-03-10-discovermass-importer.md

1232 lines
41 KiB
Markdown
Raw Permalink Normal View History

# DiscoverMass.com Importer Implementation Plan
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Import 20,284 US Catholic churches with mass/confession/adoration schedules from discovermass.com into the NearestMass database.
**Architecture:** Enumerate 11 WordPress sitemaps → fetch each church page at 10s intervals (respecting Crawl-delay) → parse server-rendered HTML for name/address/coordinates/schedules → match against existing US churches via church-matcher → upsert with full schedule data.
**Tech Stack:** TypeScript/tsx, Prisma 7 + PrismaPg adapter, pg Pool, Node.js `fetch`, regex HTML parsing (no DOM library needed — HTML is server-rendered and predictable).
---
## Chunk 1: Schema + church-matcher
### Task 1: Add discovermassId to schema
**Files:**
- Modify: `prisma/schema.prisma`
The schema lives in this repo but migrations run in BethelGuide. After editing schema.prisma here, run `npx prisma generate` to regenerate the Prisma client. Do NOT run `prisma migrate`.
- [ ] **Step 1: Find the right place in schema.prisma**
Open `prisma/schema.prisma`. Find the block of source ID fields — they look like:
```prisma
gottesdienstzeitenId String? @unique @map("gottesdienstzeiten_id")
```
This is inside the `model Church { ... }` block, after `kerknetId` and before `claimed`.
- [ ] **Step 2: Add discovermassId field**
After `gottesdienstzeitenId`:
```prisma
discovermassId String? @unique @map("discovermass_id")
```
Also find the `@@index` block near the bottom of the Church model (it groups all the index definitions). Add:
```prisma
@@index([discovermassId])
```
- [ ] **Step 3: Regenerate Prisma client**
```bash
cd /home/albert/Documents/ScraperControl
npx prisma generate
```
Expected output: `✔ Generated Prisma Client` (no errors). This does NOT touch the database — it only updates the TypeScript client.
- [ ] **Step 4: Apply migration to database**
The schema source of truth is BethelGuide. Run the migration there, then sync back. Since we're on the same dev server:
```bash
# Check if discovermass_id column already exists (it shouldn't yet)
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "\d churches" | grep discovermass
```
If the column doesn't exist, apply it directly:
```bash
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "
ALTER TABLE churches ADD COLUMN IF NOT EXISTS discovermass_id VARCHAR UNIQUE;
CREATE INDEX IF NOT EXISTS churches_discovermass_id_idx ON churches(discovermass_id);
"
```
Expected output: `ALTER TABLE` and `CREATE INDEX`
- [ ] **Step 5: Verify column exists**
```bash
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "\d churches" | grep discovermass
```
Expected output: `discovermass_id | character varying | ...`
- [ ] **Step 6: Commit**
```bash
cd /home/albert/Documents/ScraperControl
git add prisma/schema.prisma
git commit -m "feat: add discovermassId field to Church schema"
```
---
### Task 2: Update church-matcher
**Files:**
- Modify: `src/lib/church-matcher.ts`
The `ExistingChurch` interface (line ~11) lists all source IDs. The `ChurchCandidate` type (line ~122) lists optional source IDs for the candidate. The `findDuplicateChurch` function has sequential passes checking each ID before falling back to proximity+name.
- [ ] **Step 1: Add discovermassId to ExistingChurch interface**
Find the `export interface ExistingChurch {` block. After the `gottesdienstzeitenId` line, add:
```typescript
discovermassId: string | null;
```
- [ ] **Step 2: Add discovermassId to ChurchCandidate type**
Find `export type ChurchCandidate = {`. After `gottesdienstzeitenId?: string;`, add:
```typescript
discovermassId?: string;
```
- [ ] **Step 3: Add discovermassId matching pass in findDuplicateChurch**
Find the `findDuplicateChurch` function. It has a series of passes like:
```typescript
if (candidate.gottesdienstzeitenId) {
const match = existingChurches.find(c => c.gottesdienstzeitenId === candidate.gottesdienstzeitenId);
if (match) return match;
}
// Proximity + name similarity
```
Add a new pass BEFORE the proximity+name pass (after gottesdienstzeitenId):
```typescript
if (candidate.discovermassId) {
const match = existingChurches.find(c => c.discovermassId === candidate.discovermassId);
if (match) return match;
}
```
- [ ] **Step 4: Update all callers that construct ExistingChurch objects**
Search for places that build ExistingChurch objects (the in-memory push after creating a new church). Each importer has a block like:
```typescript
existingChurches.push({
id: newChurch.id,
...
gottesdienstzeitenId: null,
...
});
```
Run:
```bash
grep -rn "gottesdienstzeitenId: null" scripts/
```
For each file found: add `discovermassId: null,` after `gottesdienstzeitenId: null,`. These are the in-memory dedup arrays — they need the new field or TypeScript will complain.
Also update the `loadExistingChurches` select queries if any importer has one (check with `grep -rn "gottesdienstzeitenId: true" scripts/`).
- [ ] **Step 5: Verify TypeScript compiles**
```bash
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit
```
Expected: no errors. Fix any type errors (they'll be missing `discovermassId` fields).
- [ ] **Step 6: Commit**
```bash
# Stage church-matcher AND all importer scripts that were updated in Step 4
git add src/lib/church-matcher.ts
git add scripts/
git commit -m "feat: add discovermassId to church-matcher ExistingChurch and ChurchCandidate"
```
---
## Chunk 2: import-discovermass.ts — utilities and parsing
### Task 3: Create file skeleton + utilities
**Files:**
- Create: `scripts/import-discovermass.ts`
- [ ] **Step 1: Create the file with header, imports, constants, types**
Create `scripts/import-discovermass.ts` with this content:
```typescript
#!/usr/bin/env tsx
/**
* Import Catholic churches and mass schedules from discovermass.com (USA)
*
* discovermass.com is a US Catholic church directory with 20,284 churches.
* Data includes name, address, phone, website, coordinates, mass times,
* confessions, and adoration schedules.
*
* robots.txt specifies Crawl-delay: 10 — this importer follows that rule.
*
* Usage:
* npx tsx scripts/import-discovermass.ts --all
* npx tsx scripts/import-discovermass.ts --all --dry-run
* npx tsx scripts/import-discovermass.ts --all --resume-from 5000
* npx tsx scripts/import-discovermass.ts --all --job-id {uuid}
*/
import dotenv from 'dotenv';
import path from 'path';
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
import { Pool } from 'pg';
import { PrismaPg } from '@prisma/adapter-pg';
import { PrismaClient } from '@prisma/client';
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
const pool = new Pool({
connectionString: dbUrl,
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
});
const adapter = new PrismaPg(pool);
const prisma = new PrismaClient({ adapter });
import { findDuplicateChurch } from '../src/lib/church-matcher';
import type { ExistingChurch } from '../src/lib/church-matcher';
// ─── Constants ───────────────────────────────────────────────────────────────
const SITE_BASE = 'https://discovermass.com';
const SITEMAP_COUNT = 11;
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
const REQUEST_DELAY_MS = 10_000; // Crawl-delay: 10 from robots.txt
// ─── Types ───────────────────────────────────────────────────────────────────
interface ParsedChurch {
name: string;
address: string | null;
city: string | null;
state: string | null;
zip: string | null;
phone: string | null;
website: string | null;
lat: number;
lng: number;
}
interface ParsedMass {
dayOfWeek: number; // 0=Sun, 1=Mon, ..., 6=Sat
time: string; // HH:MM 24-hour
language: string;
notes?: string;
}
interface ParsedConf {
dayOfWeek: number;
startTime: string; // HH:MM 24-hour
endTime: string; // HH:MM 24-hour
notes?: string;
}
interface ParsedAdoration {
dayOfWeek: number;
startTime: string; // HH:MM 24-hour
endTime: string; // HH:MM 24-hour
notes?: string;
}
interface ImportStats {
total: number;
created: number;
updated: number;
skipped: number;
errors: number;
massSchedulesCreated: number;
confessionSchedulesCreated: number;
adorationSchedulesCreated: number;
}
interface CLIArgs {
all: boolean;
dryRun: boolean;
resumeFrom?: number;
jobId?: string;
}
```
- [ ] **Step 2: Add day mappings and time utilities**
Append to the file:
```typescript
// ─── Day Mappings ─────────────────────────────────────────────────────────────
// Full day names used in mass schedule <li> labels
const FULL_DAY_NAMES: Record<string, number> = {
Sunday: 0, Monday: 1, Tuesday: 2, Wednesday: 3,
Thursday: 4, Friday: 5, Saturday: 6,
};
// Abbreviated day prefixes used in confession/adoration serviceTime text
const ABBREV_DAY_NAMES: Record<string, number[]> = {
Sun: [0], Mon: [1], Tue: [2], Wed: [3],
Thr: [4], Thu: [4], Fri: [5], Sat: [6],
Weekdays: [1, 2, 3, 4, 5],
Daily: [0, 1, 2, 3, 4, 5, 6],
};
// ─── Time Utilities ───────────────────────────────────────────────────────────
/**
* Convert "5:00pm", "11:00am", "12:00pm", "12:00am" to "HH:MM" 24-hour format.
* Returns the original string unchanged if it doesn't match expected format.
*/
function convertTo24h(timeStr: string): string {
const cleaned = timeStr.trim().toLowerCase();
const m = cleaned.match(/^(\d{1,2}):(\d{2})(am|pm)$/);
if (!m) return cleaned;
let hours = parseInt(m[1], 10);
const mins = m[2];
const meridiem = m[3];
if (meridiem === 'pm' && hours !== 12) hours += 12;
if (meridiem === 'am' && hours === 12) hours = 0;
return `${String(hours).padStart(2, '0')}:${mins}`;
}
/**
* Parse "8:30am-9:00am" → ["08:30", "09:00"].
* Handles the case where both sides need to infer AM/PM from context.
* E.g. "8:30am-9:00am" → both explicit. "9:00am-6:00pm" → both explicit.
*/
function parseTimeRange(rangeStr: string): [string, string] {
// Split on '-' but careful: times may contain only one '-' between start and end
// Pattern: "8:30am-9:00am" or "3:30pm-4:30pm"
const hyphenIdx = rangeStr.indexOf('-', rangeStr.indexOf(':') + 1);
if (hyphenIdx === -1) {
const t = convertTo24h(rangeStr.trim());
return [t, t];
}
const start = convertTo24h(rangeStr.slice(0, hyphenIdx).trim());
const end = convertTo24h(rangeStr.slice(hyphenIdx + 1).trim());
return [start, end];
}
/**
* Expand abbreviated day prefix to array of dayOfWeek integers.
* Returns empty array if prefix is not recognized.
*/
function expandDayAbbrev(prefix: string): number[] {
return ABBREV_DAY_NAMES[prefix] ?? [];
}
// ─── Address Parsing ──────────────────────────────────────────────────────────
/**
* Parse "14085 Peyton Drive, Chino Hills, CA 91709" into components.
* Returns partial result on malformed input.
*/
function parseAddress(raw: string): { address: string | null; city: string | null; state: string | null; zip: string | null } {
const parts = raw.split(', ');
if (parts.length < 3) return { address: raw, city: null, state: null, zip: null };
const last = parts[parts.length - 1].trim();
const stateZipMatch = last.match(/^([A-Z]{2})\s+(\d{5}(?:-\d{4})?)$/);
if (!stateZipMatch) return { address: raw, city: null, state: null, zip: null };
return {
address: parts.slice(0, parts.length - 2).join(', ').trim(),
city: parts[parts.length - 2].trim(),
state: stateZipMatch[1],
zip: stateZipMatch[2],
};
}
```
- [ ] **Step 3: Verify utilities compile**
```bash
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit
```
Expected: no errors related to import-discovermass.ts. Other files may have pre-existing errors — focus only on this file's errors.
---
### Task 4: Add HTML parsing functions
**Files:**
- Modify: `scripts/import-discovermass.ts`
The HTML is server-rendered. The page structure has:
- `<meta property="og:title">` for church name
- US address embedded as text in a known pattern
- `<div id="sidebar-info">` for phone/website/coordinates
- Two `<ul>` blocks: one containing `<h5>Mass Times</h5>`, another containing `<h5>Other Services</h5>`
- [ ] **Step 1: Add parseChurch function**
Append to the file:
```typescript
// ─── HTML Parsing ─────────────────────────────────────────────────────────────
/**
* Parse church metadata from page HTML.
* Returns null if the page doesn't look like a valid church listing.
*/
function parseChurch(html: string): ParsedChurch | null {
// Name from OpenGraph meta tag
const nameMatch = html.match(/<meta property="og:title" content="([^"]+)"/);
if (!nameMatch) return null;
const name = nameMatch[1].trim();
if (!name || name === 'Discover Mass') return null;
// Address: match US address pattern (number + street, city, STATE ZIP)
let address: string | null = null;
let city: string | null = null;
let state: string | null = null;
let zip: string | null = null;
const addrMatch = html.match(/(\d+[^<\n,]+),\s*([^<,\n]+),\s*([A-Z]{2})\s+(\d{5}(?:-\d{4})?)/);
if (addrMatch) {
const raw = `${addrMatch[1].trim()}, ${addrMatch[2].trim()}, ${addrMatch[3]} ${addrMatch[4]}`;
const parsed = parseAddress(raw);
address = parsed.address;
city = parsed.city;
state = parsed.state;
zip = parsed.zip;
}
// Phone from sidebar
const phoneMatch = html.match(/<span class='side-phone attribute'>([^<]+)<\/span>/);
const phone = phoneMatch ? phoneMatch[1].trim() : null;
// Website from sidebar
const websiteMatch = html.match(/<span class='side-website attribute'><a href='([^']+)'/);
const website = websiteMatch ? websiteMatch[1].trim() : null;
// Coordinates from Google Maps daddr parameter
let lat = 0;
let lng = 0;
const coordMatch = html.match(/daddr=([-\d.]+),([-\d.]+)/);
if (coordMatch) {
const rawLat = parseFloat(coordMatch[1]);
const rawLng = parseFloat(coordMatch[2]);
// Validate: reject NaN, Infinity, and out-of-range values; fall back to 0 sentinel
if (isFinite(rawLat) && isFinite(rawLng) && Math.abs(rawLat) <= 90 && Math.abs(rawLng) <= 180) {
lat = rawLat;
lng = rawLng;
}
}
return { name, address, city, state, zip, phone, website, lat, lng };
}
```
- [ ] **Step 2: Add parseMassTimes function**
Append to the file:
```typescript
/**
* Parse the mass schedule from the "Mass Times" <ul> block.
*
* HTML structure:
* <ul><li><h5>Mass Times</h5></li>
* <li class=""><span class="label">Saturday</span>
* <span class='serviceTime'><span class='time'>5:00pm</span></span>,
* <span class='serviceTime'><span class='time'>7:00pm</span> <span class='language'>(Vietnamese)</span></span>
* </li>
* </ul>
*/
function parseMassTimes(html: string): ParsedMass[] {
// Cap HTML to first 100KB to prevent slow regex backtracking on malformed pages
const safeHtml = html.length > 100_000 ? html.slice(0, 100_000) : html;
const massUlMatch = safeHtml.match(/<ul>\s*<li>\s*<h5>Mass Times<\/h5>[\s\S]*?<\/ul>/);
if (!massUlMatch) return [];
const massUl = massUlMatch[0];
const results: ParsedMass[] = [];
// Each day is in a <li ...> (the first li has the h5 header, skip it).
// Use regex split to handle any class value on the li, not just empty class.
const liParts = massUl.split(/<li[^>]*>/);
for (let i = 1; i < liParts.length; i++) {
const li = liParts[i];
const labelMatch = li.match(/<span class="label">([^<]+)<\/span>/);
if (!labelMatch) continue;
const dayLabel = labelMatch[1].trim();
const dayOfWeek = FULL_DAY_NAMES[dayLabel];
if (dayOfWeek === undefined) continue;
// Each time entry is in a <span class='serviceTime'>
const serviceTimeParts = li.split("<span class='serviceTime'>");
for (let j = 1; j < serviceTimeParts.length; j++) {
const st = serviceTimeParts[j];
const timeMatch = st.match(/<span class='time'>([^<]+)<\/span>/);
if (!timeMatch) continue;
const time = convertTo24h(timeMatch[1].trim());
const langMatch = st.match(/<span class='language'>\(([^)]+)\)<\/span>/);
const language = langMatch ? langMatch[1].trim() : 'English';
const commentMatch = st.match(/<span class='comment'>([^<]+)<\/span>/);
const notes = commentMatch ? commentMatch[1].trim() : undefined;
results.push({ dayOfWeek, time, language, notes });
}
}
return results;
}
```
- [ ] **Step 3: Add parseOtherServices function**
Append to the file:
```typescript
/**
* Parse confessions and adoration from the "Other Services" <ul> block.
*
* HTML structure:
* <ul><li><h5>Other Services</h5></li>
* <li class="Confessions"><span class="label">Confessions</span>
* <span class='serviceTime'>Tue: <span class='time'>8:30am-9:00am</span></span>
* <span class='serviceTime'>Sat: <span class='time'>3:30pm-4:30pm</span> - <span class='comment'>In the Chapel</span></span>
* </li>
* <li class="Adoration"><span class="label">Adoration</span>
* <span class='serviceTime'>Weekdays: <span class='time'>9:00am-6:00pm</span> - <span class='comment'>In the Chapel</span></span>
* </li>
* </ul>
*/
function parseOtherServices(html: string): { confessions: ParsedConf[]; adorations: ParsedAdoration[] } {
// Cap HTML to first 100KB to prevent slow regex backtracking on malformed pages
const safeHtml = html.length > 100_000 ? html.slice(0, 100_000) : html;
const otherUlMatch = safeHtml.match(/<ul>\s*<li>\s*<h5>Other Services<\/h5>[\s\S]*?<\/ul>/);
if (!otherUlMatch) return { confessions: [], adorations: [] };
const otherUl = otherUlMatch[0];
function parseServiceItems(liHtml: string): Array<{ dayOfWeek: number; startTime: string; endTime: string; notes?: string }> {
const items: Array<{ dayOfWeek: number; startTime: string; endTime: string; notes?: string }> = [];
const stParts = liHtml.split("<span class='serviceTime'>");
for (let i = 1; i < stParts.length; i++) {
const st = stParts[i];
// Text before <span class='time'> contains the day abbreviation and colon
const dayTimeMatch = st.match(/^([A-Za-z]+):\s*<span class='time'>([^<]+)<\/span>/);
if (!dayTimeMatch) continue;
const days = expandDayAbbrev(dayTimeMatch[1].trim());
if (days.length === 0) continue;
const [startTime, endTime] = parseTimeRange(dayTimeMatch[2]);
const commentMatch = st.match(/<span class='comment'>([^<]+)<\/span>/);
const notes = commentMatch ? commentMatch[1].trim() : undefined;
for (const dayOfWeek of days) {
items.push({ dayOfWeek, startTime, endTime, notes });
}
}
return items;
}
const confessions: ParsedConf[] = [];
const adorations: ParsedAdoration[] = [];
const confMatch = otherUl.match(/<li class="Confessions">[\s\S]*?<\/li>/);
if (confMatch) confessions.push(...parseServiceItems(confMatch[0]));
const adorMatch = otherUl.match(/<li class="Adoration">[\s\S]*?<\/li>/);
if (adorMatch) adorations.push(...parseServiceItems(adorMatch[0]));
return { confessions, adorations };
}
```
- [ ] **Step 4: Smoke-test parsing on a real page**
Create a quick test at the end of the file temporarily:
```typescript
// TEMP: smoke test — remove before committing
if (process.argv[2] === '--test-parse') {
const testUrl = 'https://discovermass.com/church/st-paul-the-apostle-chino-hills-ca/';
const html = await (await fetch(testUrl, { headers: { 'User-Agent': USER_AGENT } })).text();
const church = parseChurch(html);
const masses = parseMassTimes(html);
const { confessions, adorations } = parseOtherServices(html);
console.log('Church:', church);
console.log('Masses:', masses.length, masses.slice(0, 3));
console.log('Confessions:', confessions.length, confessions);
console.log('Adorations:', adorations.length, adorations);
process.exit(0);
}
```
Run it:
```bash
cd /home/albert/Documents/ScraperControl
npx tsx scripts/import-discovermass.ts --test-parse
```
Expected output:
```
Church: { name: 'St. Paul the Apostle', address: '14085 Peyton Drive', city: 'Chino Hills', state: 'CA', zip: '91709', phone: '(909) 465-5503', website: 'http://www.sptacc.org', lat: 33.996887, lng: -117.732407 }
Masses: 14 [ { dayOfWeek: 6, time: '17:00', language: 'English', notes: undefined }, ... ]
Confessions: 4 [...]
Adorations: 6 [...]
```
If the counts don't match, debug the regex patterns against the raw HTML:
```bash
curl -s https://discovermass.com/church/st-paul-the-apostle-chino-hills-ca/ > /tmp/dm-test.html
# Then inspect the relevant sections manually
```
- [ ] **Step 5: Remove the temp test block and commit**
Remove the `if (process.argv[2] === '--test-parse')` block from the file.
```bash
git add scripts/import-discovermass.ts
git commit -m "feat: add discovermass parsing utilities (church, mass, confession, adoration)"
```
---
## Chunk 3: import-discovermass.ts — main import loop
### Task 5: Add HTTP helpers + sitemap enumeration
**Files:**
- Modify: `scripts/import-discovermass.ts`
- [ ] **Step 1: Add HTTP helpers and loadExistingChurches**
Append to the file:
```typescript
// ─── HTTP Helpers ─────────────────────────────────────────────────────────────
async function fetchHtml(url: string): Promise<string> {
const res = await fetch(url, {
headers: { 'User-Agent': USER_AGENT },
});
if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`);
return res.text();
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
// ─── Sitemap Enumeration ──────────────────────────────────────────────────────
/**
* Fetch all 11 WordPress item sitemaps and return every church URL.
* No rate limiting needed — only 11 sitemap requests.
*/
async function getAllChurchUrls(): Promise<string[]> {
const urls: string[] = [];
for (let i = 1; i <= SITEMAP_COUNT; i++) {
const sitemapUrl = `${SITE_BASE}/wp-sitemap-posts-item-${i}.xml`;
console.log(`Fetching sitemap ${i}/${SITEMAP_COUNT}...`);
const xml = await fetchHtml(sitemapUrl);
const matches = xml.matchAll(/<loc>(https:\/\/discovermass\.com\/church\/[^<]+)<\/loc>/g);
for (const match of matches) {
urls.push(match[1]);
}
}
console.log(`Total church URLs: ${urls.length}`);
return urls;
}
// ─── DB Helpers ───────────────────────────────────────────────────────────────
/**
* Load all US churches from DB into memory for dedup matching.
* Only loads US churches to keep the array manageable.
*/
async function loadExistingChurches(): Promise<ExistingChurch[]> {
console.log('Loading existing US churches from DB...');
const churches = await prisma.church.findMany({
where: { country: 'US' },
select: {
id: true, name: true, latitude: true, longitude: true,
osmId: true, baiduId: true, masstimesId: true,
orarimesseId: true, massSchedulesPhId: true, philmassId: true,
horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true,
messesInfoId: true, bohosluzbyId: true, miserendId: true,
kerknetId: true, gottesdienstzeitenId: true, discovermassId: true,
source: true, website: true, phone: true, address: true, country: true,
},
});
console.log(`Loaded ${churches.length} existing US churches`);
return churches as ExistingChurch[];
}
```
---
### Task 6: Add processChurch + main() + CLI parsing
**Files:**
- Modify: `scripts/import-discovermass.ts`
- [ ] **Step 1: Add processChurch function**
Append to the file:
```typescript
// ─── Church Processing ────────────────────────────────────────────────────────
async function processChurch(
url: string,
existingChurches: ExistingChurch[],
args: CLIArgs,
stats: ImportStats,
): Promise<void> {
const slug = url.replace('https://discovermass.com/church/', '').replace(/\/$/, '');
stats.total++;
try {
const html = await fetchHtml(url);
const parsed = parseChurch(html);
if (!parsed) {
console.log(` [skip] Could not parse: ${slug}`);
stats.skipped++;
return;
}
const masses = parseMassTimes(html);
const { confessions, adorations } = parseOtherServices(html);
if (args.dryRun) {
console.log(` [dry-run] ${parsed.name} — ${masses.length} masses, ${confessions.length} confessions, ${adorations.length} adorations`);
return;
}
const candidate = {
name: parsed.name,
lat: parsed.lat,
lng: parsed.lng,
discovermassId: slug,
};
const duplicate = findDuplicateChurch(candidate, existingChurches);
if (duplicate) {
// Update existing church — only fill blank fields, always replace schedules
const updateData: Record<string, unknown> = { discovermassId: slug };
if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone;
if (!duplicate.website && parsed.website) {
updateData.website = parsed.website;
updateData.hasWebsite = true;
}
if (parsed.lat !== 0 && duplicate.latitude === 0) {
updateData.latitude = parsed.lat;
updateData.longitude = parsed.lng;
}
try {
await prisma.$transaction(async (tx) => {
await tx.church.update({ where: { id: duplicate.id }, data: updateData });
if (masses.length > 0) {
await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } });
await tx.massSchedule.createMany({
data: masses.map(m => ({
churchId: duplicate.id,
dayOfWeek: m.dayOfWeek,
time: m.time,
language: m.language,
notes: m.notes ?? null,
})),
});
}
if (confessions.length > 0) {
await tx.confessionSchedule.deleteMany({ where: { churchId: duplicate.id } });
await tx.confessionSchedule.createMany({
data: confessions.map(c => ({
churchId: duplicate.id,
dayOfWeek: c.dayOfWeek,
startTime: c.startTime,
endTime: c.endTime,
notes: c.notes ?? null,
})),
});
}
if (adorations.length > 0) {
await tx.adorationSchedule.deleteMany({ where: { churchId: duplicate.id } });
await tx.adorationSchedule.createMany({
data: adorations.map(a => ({
churchId: duplicate.id,
dayOfWeek: a.dayOfWeek,
startTime: a.startTime,
endTime: a.endTime,
notes: a.notes ?? null,
})),
});
}
await tx.church.update({
where: { id: duplicate.id },
data: { lastScrapedAt: new Date() },
});
});
// Update in-memory entry for within-run dedup
duplicate.discovermassId = slug;
stats.updated++;
} catch (err) {
if (err instanceof Error && err.message.includes('Unique constraint')) {
stats.skipped++;
return;
}
throw err;
}
} else {
// Create new church
try {
const church = await prisma.church.create({
data: {
name: parsed.name,
address: parsed.address,
city: parsed.city,
state: parsed.state,
zip: parsed.zip,
country: 'US',
phone: parsed.phone,
website: parsed.website,
hasWebsite: !!parsed.website,
latitude: parsed.lat,
longitude: parsed.lng,
discovermassId: slug,
source: 'discovermass',
},
});
// Add to in-memory array for within-run dedup
existingChurches.push({
id: church.id,
name: parsed.name,
latitude: parsed.lat,
longitude: parsed.lng,
osmId: null,
baiduId: null,
masstimesId: null,
orarimesseId: null,
massSchedulesPhId: null,
philmassId: null,
horariosMisasId: null,
mszeInfoId: null,
weekdayMassesId: null,
messesInfoId: null,
bohosluzbyId: null,
miserendId: null,
kerknetId: null,
gottesdienstzeitenId: null,
discovermassId: slug,
source: 'discovermass',
website: parsed.website,
phone: parsed.phone,
address: parsed.address,
country: 'US',
});
if (masses.length > 0) {
await prisma.massSchedule.createMany({
data: masses.map(m => ({
churchId: church.id,
dayOfWeek: m.dayOfWeek,
time: m.time,
language: m.language,
notes: m.notes ?? null,
})),
});
}
if (confessions.length > 0) {
await prisma.confessionSchedule.createMany({
data: confessions.map(c => ({
churchId: church.id,
dayOfWeek: c.dayOfWeek,
startTime: c.startTime,
endTime: c.endTime,
notes: c.notes ?? null,
})),
});
}
if (adorations.length > 0) {
await prisma.adorationSchedule.createMany({
data: adorations.map(a => ({
churchId: church.id,
dayOfWeek: a.dayOfWeek,
startTime: a.startTime,
endTime: a.endTime,
notes: a.notes ?? null,
})),
});
}
await prisma.church.update({
where: { id: church.id },
data: { lastScrapedAt: new Date() },
});
stats.created++;
} catch (err) {
if (err instanceof Error && err.message.includes('Unique constraint')) {
stats.skipped++;
return;
}
throw err;
}
}
stats.massSchedulesCreated += masses.length;
stats.confessionSchedulesCreated += confessions.length;
stats.adorationSchedulesCreated += adorations.length;
console.log(
` [${duplicate ? 'update' : 'create'}] ${parsed.name} — ` +
`${masses.length}M ${confessions.length}C ${adorations.length}A — ` +
`${stats.total} total (${stats.created} new, ${stats.updated} upd, ${stats.errors} err)`
);
} catch (err) {
stats.errors++;
console.error(` [error] ${slug}: ${err instanceof Error ? err.message : err}`);
}
}
```
- [ ] **Step 2: Add parseCLIArgs + main()**
Append to the file:
```typescript
// ─── CLI Parsing ──────────────────────────────────────────────────────────────
function parseCLIArgs(): CLIArgs {
const args = process.argv.slice(2);
const result: CLIArgs = { all: false, dryRun: false };
for (let i = 0; i < args.length; i++) {
switch (args[i]) {
case '--all': result.all = true; break;
case '--dry-run': result.dryRun = true; break;
case '--resume-from': result.resumeFrom = parseInt(args[++i], 10); break;
case '--job-id': result.jobId = args[++i]; break;
}
}
return result;
}
// ─── Main ─────────────────────────────────────────────────────────────────────
async function main() {
const args = parseCLIArgs();
if (!args.all) {
console.error('Usage: npx tsx scripts/import-discovermass.ts --all [--dry-run] [--resume-from N] [--job-id UUID]');
process.exit(1);
}
// Update job status to 'running' if job-id provided
if (args.jobId) {
try {
await prisma.backgroundJob.update({
where: { id: args.jobId },
data: { status: 'running', startedAt: new Date() },
});
} catch { /* Job might not exist yet */ }
}
const stats: ImportStats = {
total: 0, created: 0, updated: 0, skipped: 0, errors: 0,
massSchedulesCreated: 0, confessionSchedulesCreated: 0, adorationSchedulesCreated: 0,
};
try {
// Step 1: Enumerate all church URLs from sitemaps
const urls = await getAllChurchUrls();
// Step 2: Load existing US churches for dedup
const existingChurches = await loadExistingChurches();
// Step 3: Apply resume-from offset
const startIdx = args.resumeFrom ?? 0;
const churchUrls = urls.slice(startIdx);
console.log(`\nProcessing ${churchUrls.length} churches (starting from index ${startIdx})...\n`);
// Step 4: Process each church with 10s delay between requests
for (let i = 0; i < churchUrls.length; i++) {
const url = churchUrls[i];
const overallIdx = startIdx + i;
console.log(`[${overallIdx + 1}/${urls.length}] ${url}`);
await processChurch(url, existingChurches, args, stats);
// Rate limit: 10s delay between church pages (robots.txt Crawl-delay: 10)
if (i < churchUrls.length - 1) {
await sleep(REQUEST_DELAY_MS);
}
}
} finally {
// Always print stats and update job status
console.log('\n─── Import Complete ───────────────────────────────────────');
console.log(`Total processed: ${stats.total}`);
console.log(`Created: ${stats.created}`);
console.log(`Updated: ${stats.updated}`);
console.log(`Skipped: ${stats.skipped}`);
console.log(`Errors: ${stats.errors}`);
console.log(`Mass schedules: ${stats.massSchedulesCreated}`);
console.log(`Confession sched: ${stats.confessionSchedulesCreated}`);
console.log(`Adoration sched: ${stats.adorationSchedulesCreated}`);
if (args.jobId) {
const status = stats.errors > stats.total * 0.1 ? 'failed' : 'completed';
try {
await prisma.backgroundJob.update({
where: { id: args.jobId },
data: {
status,
completedAt: new Date(),
processed: stats.total,
succeeded: stats.created + stats.updated,
failed: stats.errors,
itemsFound: stats.massSchedulesCreated,
},
});
} catch { /* Ignore */ }
}
await prisma.$disconnect();
await pool.end();
}
}
main().catch((err) => {
console.error('Fatal error:', err);
process.exit(1);
});
```
- [ ] **Step 3: Run dry-run on first 3 churches**
```bash
cd /home/albert/Documents/ScraperControl
# Get a few URLs from the first sitemap to test with
curl -s https://discovermass.com/wp-sitemap-posts-item-1.xml | grep -o '<loc>[^<]*</loc>' | head -3
```
Then test dry-run:
```bash
npx tsx scripts/import-discovermass.ts --all --dry-run --resume-from 0
```
Wait ~30 seconds (3 churches × 10s delay). Expected output:
```
Connecting to database: postgresql://...
Fetching sitemap 1/11...
...
Total church URLs: 20284
Loading existing US churches from DB...
Loaded XXXX existing US churches
Processing 20284 churches (starting from index 0)...
[1/20284] https://discovermass.com/church/some-church/
[dry-run] Some Church — 8 masses, 2 confessions, 3 adorations
[2/20284] ...
```
Stop with Ctrl+C after a few churches.
- [ ] **Step 4: Verify TypeScript compiles**
```bash
npx tsc --noEmit
```
Fix any type errors before committing.
- [ ] **Step 5: Commit**
```bash
git add scripts/import-discovermass.ts
git commit -m "feat: add import-discovermass.ts — USA church importer with 10s crawl delay"
```
---
## Chunk 4: Integration + Docker deployment
### Task 7: package.json + scheduler integration
**Files:**
- Modify: `package.json`
- Modify: `scripts/scheduler.ts`
- [ ] **Step 1: Add import:discovermass to package.json**
Open `package.json`. Find the `"scripts"` section. After `"import:masstimes-api"` (or whichever is last in the import group), add:
```json
"import:discovermass": "tsx scripts/import-discovermass.ts",
```
- [ ] **Step 2: Add discovermass-import case to getJobCommand in scheduler.ts**
Open `scripts/scheduler.ts`. Find the `getJobCommand` function. After the `masstimes-api-import` case block, add:
```typescript
case 'discovermass-import': {
const args = ['tsx', 'scripts/import-discovermass.ts', '--all'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
// Note: --job-id is appended by startJobProcess() in the scheduler, not here.
}
```
- [ ] **Step 3: Add discovermass-import to PIPELINE_GROUPS**
In `scripts/scheduler.ts`, find `PIPELINE_GROUPS`. In the first group's `phases` array, after the `masstimes-api-import` entry, add:
```typescript
{ name: 'discovermass-import', type: 'discovermass-import', config: {} },
```
- [ ] **Step 4: Verify TypeScript compiles**
```bash
cd /home/albert/Documents/ScraperControl
npx tsc --noEmit
```
- [ ] **Step 5: Commit**
```bash
git add package.json scripts/scheduler.ts
git commit -m "feat: add discovermass-import to scheduler pipeline and package.json"
```
---
### Task 8: Deploy to Docker and run
The importer will run for ~56 hours in the Docker scheduler container. The scheduler picks it up as part of the PIPELINE_GROUPS sequence.
- [ ] **Step 1: Deploy to Docker directory**
```bash
bash /home/albert/Documents/ScraperControl/scripts/deploy-local.sh
```
Or manually:
```bash
rsync -avz --exclude node_modules --exclude .next --exclude '.env*' \
--exclude .git --exclude .claude --exclude .playwright-mcp \
~/Documents/ScraperControl/ /opt/docker/scraper-control/
```
- [ ] **Step 2: Rebuild Docker images to pick up new script**
```bash
cd /opt/docker/scraper-control
docker compose build scheduler
```
Expected: build completes without errors.
- [ ] **Step 3: Create a manual job via the admin API to trigger the import immediately**
The scheduler can run imports as manual jobs (priority over pipeline):
```bash
curl -X POST http://localhost:3001/api/admin/jobs \
-H "Content-Type: application/json" \
-H "X-Admin-Key: $(grep ADMIN_API_KEY /opt/docker/scraper-control/.env | cut -d= -f2)" \
-d '{"type": "discovermass-import", "config": {}}'
```
Expected: `{"id": "...", "status": "pending", ...}`
- [ ] **Step 4: Restart the scheduler to pick up the new job**
```bash
cd /opt/docker/scraper-control
docker compose restart scheduler
```
- [ ] **Step 5: Monitor the job**
Check logs:
```bash
docker compose logs -f scheduler --since 1m
```
Expected output:
```
Fetching sitemap 1/11...
...
Total church URLs: 20284
Loading existing US churches from DB...
[1/20284] https://discovermass.com/church/...
[create] St. Paul the Apostle — 14M 4C 6A — 1 total (0 new, 1 upd, 0 err)
```
The St. Paul the Apostle church we seeded earlier should show as `[update]` (matched by name+proximity), linking the `discovermassId` to the existing record.
- [ ] **Step 6: Verify St. Paul the Apostle was matched (not duplicated)**
After the importer processes the Chino Hills area (~a few hours in since it's alphabetical by state/slug), run:
```bash
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass \
-c "SELECT name, city, state, discovermass_id, created_at FROM churches WHERE name ILIKE '%Paul%' AND city = 'Chino Hills';"
```
Expected: 1 row with `discovermass_id = 'st-paul-the-apostle-chino-hills-ca'` and `created_at` from the earlier seed (not a new timestamp — it's an update, not a create).
- [ ] **Step 7: Let the full run complete (~56 hours)**
The scheduler will log progress. You can check status anytime:
```bash
docker compose logs scheduler --since 10m | tail -20
```
Or via the admin dashboard at `http://192.168.0.241:3001` — the job will appear in the Jobs tab with status `running` and progress tracked in the `processed`/`succeeded`/`failed` fields.
Final expected stats after completion:
```
Total processed: 20284
Created: ~17000-19000 (new US churches)
Updated: ~1000-3000 (matched against OSM/MassTimes churches)
Errors: < 50 (network blips)
Mass schedules: ~70000-90000
Confession sched: ~30000-50000
Adoration sched: ~10000-20000
```