20,284 US churches with mass/confession/adoration schedules. 10s crawl delay (robots.txt), Docker deployment via scheduler. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1223 lines
40 KiB
Markdown
1223 lines
40 KiB
Markdown
# DiscoverMass.com Importer Implementation Plan
|
||
|
||
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||
|
||
**Goal:** Import 20,284 US Catholic churches with mass/confession/adoration schedules from discovermass.com into the NearestMass database.
|
||
|
||
**Architecture:** Enumerate 11 WordPress sitemaps → fetch each church page at 10s intervals (respecting Crawl-delay) → parse server-rendered HTML for name/address/coordinates/schedules → match against existing US churches via church-matcher → upsert with full schedule data.
|
||
|
||
**Tech Stack:** TypeScript/tsx, Prisma 7 + PrismaPg adapter, pg Pool, Node.js `fetch`, regex HTML parsing (no DOM library needed — HTML is server-rendered and predictable).
|
||
|
||
---
|
||
|
||
## Chunk 1: Schema + church-matcher
|
||
|
||
### Task 1: Add discovermassId to schema
|
||
|
||
**Files:**
|
||
- Modify: `prisma/schema.prisma`
|
||
|
||
The schema lives in this repo but migrations run in BethelGuide. After editing schema.prisma here, run `npx prisma generate` to regenerate the Prisma client. Do NOT run `prisma migrate`.
|
||
|
||
- [ ] **Step 1: Find the right place in schema.prisma**
|
||
|
||
Open `prisma/schema.prisma`. Find the block of source ID fields — they look like:
|
||
```prisma
|
||
gottesdienstzeitenId String? @unique @map("gottesdienstzeiten_id")
|
||
```
|
||
This is inside the `model Church { ... }` block, after `kerknetId` and before `claimed`.
|
||
|
||
- [ ] **Step 2: Add discovermassId field**
|
||
|
||
After `gottesdienstzeitenId`:
|
||
```prisma
|
||
discovermassId String? @unique @map("discovermass_id")
|
||
```
|
||
|
||
Also find the `@@index` block near the bottom of the Church model (it groups all the index definitions). Add:
|
||
```prisma
|
||
@@index([discovermassId])
|
||
```
|
||
|
||
- [ ] **Step 3: Regenerate Prisma client**
|
||
|
||
```bash
|
||
cd /home/albert/Documents/ScraperControl
|
||
npx prisma generate
|
||
```
|
||
|
||
Expected output: `✔ Generated Prisma Client` (no errors). This does NOT touch the database — it only updates the TypeScript client.
|
||
|
||
- [ ] **Step 4: Apply migration to database**
|
||
|
||
The schema source of truth is BethelGuide. Run the migration there, then sync back. Since we're on the same dev server:
|
||
|
||
```bash
|
||
# Check if discovermass_id column already exists (it shouldn't yet)
|
||
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "\d churches" | grep discovermass
|
||
```
|
||
|
||
If the column doesn't exist, apply it directly:
|
||
```bash
|
||
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "
|
||
ALTER TABLE churches ADD COLUMN IF NOT EXISTS discovermass_id VARCHAR UNIQUE;
|
||
CREATE INDEX IF NOT EXISTS churches_discovermass_id_idx ON churches(discovermass_id);
|
||
"
|
||
```
|
||
|
||
Expected output: `ALTER TABLE` and `CREATE INDEX`
|
||
|
||
- [ ] **Step 5: Verify column exists**
|
||
|
||
```bash
|
||
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass -c "\d churches" | grep discovermass
|
||
```
|
||
|
||
Expected output: `discovermass_id | character varying | ...`
|
||
|
||
- [ ] **Step 6: Commit**
|
||
|
||
```bash
|
||
cd /home/albert/Documents/ScraperControl
|
||
git add prisma/schema.prisma
|
||
git commit -m "feat: add discovermassId field to Church schema"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 2: Update church-matcher
|
||
|
||
**Files:**
|
||
- Modify: `src/lib/church-matcher.ts`
|
||
|
||
The `ExistingChurch` interface (line ~11) lists all source IDs. The `ChurchCandidate` type (line ~122) lists optional source IDs for the candidate. The `findDuplicateChurch` function has sequential passes checking each ID before falling back to proximity+name.
|
||
|
||
- [ ] **Step 1: Add discovermassId to ExistingChurch interface**
|
||
|
||
Find the `export interface ExistingChurch {` block. After the `gottesdienstzeitenId` line, add:
|
||
```typescript
|
||
discovermassId: string | null;
|
||
```
|
||
|
||
- [ ] **Step 2: Add discovermassId to ChurchCandidate type**
|
||
|
||
Find `export type ChurchCandidate = {`. After `gottesdienstzeitenId?: string;`, add:
|
||
```typescript
|
||
discovermassId?: string;
|
||
```
|
||
|
||
- [ ] **Step 3: Add discovermassId matching pass in findDuplicateChurch**
|
||
|
||
Find the `findDuplicateChurch` function. It has a series of passes like:
|
||
```typescript
|
||
if (candidate.gottesdienstzeitenId) {
|
||
const match = existingChurches.find(c => c.gottesdienstzeitenId === candidate.gottesdienstzeitenId);
|
||
if (match) return match;
|
||
}
|
||
// Proximity + name similarity
|
||
```
|
||
|
||
Add a new pass BEFORE the proximity+name pass (after gottesdienstzeitenId):
|
||
```typescript
|
||
if (candidate.discovermassId) {
|
||
const match = existingChurches.find(c => c.discovermassId === candidate.discovermassId);
|
||
if (match) return match;
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Update all callers that construct ExistingChurch objects**
|
||
|
||
Search for places that build ExistingChurch objects (the in-memory push after creating a new church). Each importer has a block like:
|
||
```typescript
|
||
existingChurches.push({
|
||
id: newChurch.id,
|
||
...
|
||
gottesdienstzeitenId: null,
|
||
...
|
||
});
|
||
```
|
||
|
||
Run:
|
||
```bash
|
||
grep -rn "gottesdienstzeitenId: null" scripts/
|
||
```
|
||
|
||
For each file found: add `discovermassId: null,` after `gottesdienstzeitenId: null,`. These are the in-memory dedup arrays — they need the new field or TypeScript will complain.
|
||
|
||
Also update the `loadExistingChurches` select queries if any importer has one (check with `grep -rn "gottesdienstzeitenId: true" scripts/`).
|
||
|
||
- [ ] **Step 5: Verify TypeScript compiles**
|
||
|
||
```bash
|
||
cd /home/albert/Documents/ScraperControl
|
||
npx tsc --noEmit
|
||
```
|
||
|
||
Expected: no errors. Fix any type errors (they'll be missing `discovermassId` fields).
|
||
|
||
- [ ] **Step 6: Commit**
|
||
|
||
```bash
|
||
# Stage church-matcher AND all importer scripts that were updated in Step 4
|
||
git add src/lib/church-matcher.ts
|
||
git add scripts/
|
||
git commit -m "feat: add discovermassId to church-matcher ExistingChurch and ChurchCandidate"
|
||
```
|
||
|
||
---
|
||
|
||
## Chunk 2: import-discovermass.ts — utilities and parsing
|
||
|
||
### Task 3: Create file skeleton + utilities
|
||
|
||
**Files:**
|
||
- Create: `scripts/import-discovermass.ts`
|
||
|
||
- [ ] **Step 1: Create the file with header, imports, constants, types**
|
||
|
||
Create `scripts/import-discovermass.ts` with this content:
|
||
|
||
```typescript
|
||
#!/usr/bin/env tsx
|
||
/**
|
||
* Import Catholic churches and mass schedules from discovermass.com (USA)
|
||
*
|
||
* discovermass.com is a US Catholic church directory with 20,284 churches.
|
||
* Data includes name, address, phone, website, coordinates, mass times,
|
||
* confessions, and adoration schedules.
|
||
*
|
||
* robots.txt specifies Crawl-delay: 10 — this importer follows that rule.
|
||
*
|
||
* Usage:
|
||
* npx tsx scripts/import-discovermass.ts --all
|
||
* npx tsx scripts/import-discovermass.ts --all --dry-run
|
||
* npx tsx scripts/import-discovermass.ts --all --resume-from 5000
|
||
* npx tsx scripts/import-discovermass.ts --all --job-id {uuid}
|
||
*/
|
||
|
||
import dotenv from 'dotenv';
|
||
import path from 'path';
|
||
|
||
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||
|
||
import { Pool } from 'pg';
|
||
import { PrismaPg } from '@prisma/adapter-pg';
|
||
import { PrismaClient } from '@prisma/client';
|
||
|
||
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
|
||
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
|
||
const pool = new Pool({
|
||
connectionString: dbUrl,
|
||
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
|
||
});
|
||
const adapter = new PrismaPg(pool);
|
||
const prisma = new PrismaClient({ adapter });
|
||
|
||
import { findDuplicateChurch } from '../src/lib/church-matcher';
|
||
import type { ExistingChurch } from '../src/lib/church-matcher';
|
||
|
||
// ─── Constants ───────────────────────────────────────────────────────────────
|
||
|
||
const SITE_BASE = 'https://discovermass.com';
|
||
const SITEMAP_COUNT = 11;
|
||
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
|
||
const REQUEST_DELAY_MS = 10_000; // Crawl-delay: 10 from robots.txt
|
||
|
||
// ─── Types ───────────────────────────────────────────────────────────────────
|
||
|
||
interface ParsedChurch {
|
||
name: string;
|
||
address: string | null;
|
||
city: string | null;
|
||
state: string | null;
|
||
zip: string | null;
|
||
phone: string | null;
|
||
website: string | null;
|
||
lat: number;
|
||
lng: number;
|
||
}
|
||
|
||
interface ParsedMass {
|
||
dayOfWeek: number; // 0=Sun, 1=Mon, ..., 6=Sat
|
||
time: string; // HH:MM 24-hour
|
||
language: string;
|
||
notes?: string;
|
||
}
|
||
|
||
interface ParsedConf {
|
||
dayOfWeek: number;
|
||
startTime: string; // HH:MM 24-hour
|
||
endTime: string; // HH:MM 24-hour
|
||
notes?: string;
|
||
}
|
||
|
||
interface ParsedAdoration {
|
||
dayOfWeek: number;
|
||
startTime: string; // HH:MM 24-hour
|
||
endTime: string; // HH:MM 24-hour
|
||
notes?: string;
|
||
}
|
||
|
||
interface ImportStats {
|
||
total: number;
|
||
created: number;
|
||
updated: number;
|
||
skipped: number;
|
||
errors: number;
|
||
massSchedulesCreated: number;
|
||
confessionSchedulesCreated: number;
|
||
adorationSchedulesCreated: number;
|
||
}
|
||
|
||
interface CLIArgs {
|
||
all: boolean;
|
||
dryRun: boolean;
|
||
resumeFrom?: number;
|
||
jobId?: string;
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Add day mappings and time utilities**
|
||
|
||
Append to the file:
|
||
|
||
```typescript
|
||
// ─── Day Mappings ─────────────────────────────────────────────────────────────
|
||
|
||
// Full day names used in mass schedule <li> labels
|
||
const FULL_DAY_NAMES: Record<string, number> = {
|
||
Sunday: 0, Monday: 1, Tuesday: 2, Wednesday: 3,
|
||
Thursday: 4, Friday: 5, Saturday: 6,
|
||
};
|
||
|
||
// Abbreviated day prefixes used in confession/adoration serviceTime text
|
||
const ABBREV_DAY_NAMES: Record<string, number[]> = {
|
||
Sun: [0], Mon: [1], Tue: [2], Wed: [3],
|
||
Thr: [4], Thu: [4], Fri: [5], Sat: [6],
|
||
Weekdays: [1, 2, 3, 4, 5],
|
||
Daily: [0, 1, 2, 3, 4, 5, 6],
|
||
};
|
||
|
||
// ─── Time Utilities ───────────────────────────────────────────────────────────
|
||
|
||
/**
|
||
* Convert "5:00pm", "11:00am", "12:00pm", "12:00am" to "HH:MM" 24-hour format.
|
||
* Returns the original string unchanged if it doesn't match expected format.
|
||
*/
|
||
function convertTo24h(timeStr: string): string {
|
||
const cleaned = timeStr.trim().toLowerCase();
|
||
const m = cleaned.match(/^(\d{1,2}):(\d{2})(am|pm)$/);
|
||
if (!m) return cleaned;
|
||
let hours = parseInt(m[1], 10);
|
||
const mins = m[2];
|
||
const meridiem = m[3];
|
||
if (meridiem === 'pm' && hours !== 12) hours += 12;
|
||
if (meridiem === 'am' && hours === 12) hours = 0;
|
||
return `${String(hours).padStart(2, '0')}:${mins}`;
|
||
}
|
||
|
||
/**
|
||
* Parse "8:30am-9:00am" → ["08:30", "09:00"].
|
||
* Handles the case where both sides need to infer AM/PM from context.
|
||
* E.g. "8:30am-9:00am" → both explicit. "9:00am-6:00pm" → both explicit.
|
||
*/
|
||
function parseTimeRange(rangeStr: string): [string, string] {
|
||
// Split on '-' but careful: times may contain only one '-' between start and end
|
||
// Pattern: "8:30am-9:00am" or "3:30pm-4:30pm"
|
||
const hyphenIdx = rangeStr.indexOf('-', rangeStr.indexOf(':') + 1);
|
||
if (hyphenIdx === -1) {
|
||
const t = convertTo24h(rangeStr.trim());
|
||
return [t, t];
|
||
}
|
||
const start = convertTo24h(rangeStr.slice(0, hyphenIdx).trim());
|
||
const end = convertTo24h(rangeStr.slice(hyphenIdx + 1).trim());
|
||
return [start, end];
|
||
}
|
||
|
||
/**
|
||
* Expand abbreviated day prefix to array of dayOfWeek integers.
|
||
* Returns empty array if prefix is not recognized.
|
||
*/
|
||
function expandDayAbbrev(prefix: string): number[] {
|
||
return ABBREV_DAY_NAMES[prefix] ?? [];
|
||
}
|
||
|
||
// ─── Address Parsing ──────────────────────────────────────────────────────────
|
||
|
||
/**
|
||
* Parse "14085 Peyton Drive, Chino Hills, CA 91709" into components.
|
||
* Returns partial result on malformed input.
|
||
*/
|
||
function parseAddress(raw: string): { address: string | null; city: string | null; state: string | null; zip: string | null } {
|
||
const parts = raw.split(', ');
|
||
if (parts.length < 3) return { address: raw, city: null, state: null, zip: null };
|
||
const last = parts[parts.length - 1].trim();
|
||
const stateZipMatch = last.match(/^([A-Z]{2})\s+(\d{5}(?:-\d{4})?)$/);
|
||
if (!stateZipMatch) return { address: raw, city: null, state: null, zip: null };
|
||
return {
|
||
address: parts.slice(0, parts.length - 2).join(', ').trim(),
|
||
city: parts[parts.length - 2].trim(),
|
||
state: stateZipMatch[1],
|
||
zip: stateZipMatch[2],
|
||
};
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 3: Verify utilities compile**
|
||
|
||
```bash
|
||
cd /home/albert/Documents/ScraperControl
|
||
npx tsc --noEmit
|
||
```
|
||
|
||
Expected: no errors related to import-discovermass.ts. Other files may have pre-existing errors — focus only on this file's errors.
|
||
|
||
---
|
||
|
||
### Task 4: Add HTML parsing functions
|
||
|
||
**Files:**
|
||
- Modify: `scripts/import-discovermass.ts`
|
||
|
||
The HTML is server-rendered. The page structure has:
|
||
- `<meta property="og:title">` for church name
|
||
- US address embedded as text in a known pattern
|
||
- `<div id="sidebar-info">` for phone/website/coordinates
|
||
- Two `<ul>` blocks: one containing `<h5>Mass Times</h5>`, another containing `<h5>Other Services</h5>`
|
||
|
||
- [ ] **Step 1: Add parseChurch function**
|
||
|
||
Append to the file:
|
||
|
||
```typescript
|
||
// ─── HTML Parsing ─────────────────────────────────────────────────────────────
|
||
|
||
/**
|
||
* Parse church metadata from page HTML.
|
||
* Returns null if the page doesn't look like a valid church listing.
|
||
*/
|
||
function parseChurch(html: string): ParsedChurch | null {
|
||
// Name from OpenGraph meta tag
|
||
const nameMatch = html.match(/<meta property="og:title" content="([^"]+)"/);
|
||
if (!nameMatch) return null;
|
||
const name = nameMatch[1].trim();
|
||
if (!name || name === 'Discover Mass') return null;
|
||
|
||
// Address: match US address pattern (number + street, city, STATE ZIP)
|
||
let address: string | null = null;
|
||
let city: string | null = null;
|
||
let state: string | null = null;
|
||
let zip: string | null = null;
|
||
const addrMatch = html.match(/(\d+[^<\n,]+),\s*([^<,\n]+),\s*([A-Z]{2})\s+(\d{5}(?:-\d{4})?)/);
|
||
if (addrMatch) {
|
||
const raw = `${addrMatch[1].trim()}, ${addrMatch[2].trim()}, ${addrMatch[3]} ${addrMatch[4]}`;
|
||
const parsed = parseAddress(raw);
|
||
address = parsed.address;
|
||
city = parsed.city;
|
||
state = parsed.state;
|
||
zip = parsed.zip;
|
||
}
|
||
|
||
// Phone from sidebar
|
||
const phoneMatch = html.match(/<span class='side-phone attribute'>([^<]+)<\/span>/);
|
||
const phone = phoneMatch ? phoneMatch[1].trim() : null;
|
||
|
||
// Website from sidebar
|
||
const websiteMatch = html.match(/<span class='side-website attribute'><a href='([^']+)'/);
|
||
const website = websiteMatch ? websiteMatch[1].trim() : null;
|
||
|
||
// Coordinates from Google Maps daddr parameter
|
||
let lat = 0;
|
||
let lng = 0;
|
||
const coordMatch = html.match(/daddr=([-\d.]+),([-\d.]+)/);
|
||
if (coordMatch) {
|
||
lat = parseFloat(coordMatch[1]);
|
||
lng = parseFloat(coordMatch[2]);
|
||
}
|
||
|
||
return { name, address, city, state, zip, phone, website, lat, lng };
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Add parseMassTimes function**
|
||
|
||
Append to the file:
|
||
|
||
```typescript
|
||
/**
|
||
* Parse the mass schedule from the "Mass Times" <ul> block.
|
||
*
|
||
* HTML structure:
|
||
* <ul><li><h5>Mass Times</h5></li>
|
||
* <li class=""><span class="label">Saturday</span>
|
||
* <span class='serviceTime'><span class='time'>5:00pm</span></span>,
|
||
* <span class='serviceTime'><span class='time'>7:00pm</span> <span class='language'>(Vietnamese)</span></span>
|
||
* </li>
|
||
* </ul>
|
||
*/
|
||
function parseMassTimes(html: string): ParsedMass[] {
|
||
const massUlMatch = html.match(/<ul>\s*<li>\s*<h5>Mass Times<\/h5>[\s\S]*?<\/ul>/);
|
||
if (!massUlMatch) return [];
|
||
const massUl = massUlMatch[0];
|
||
|
||
const results: ParsedMass[] = [];
|
||
|
||
// Each day is in a <li ...> (the first li has the h5 header, skip it).
|
||
// Use regex split to handle any class value on the li, not just empty class.
|
||
const liParts = massUl.split(/<li[^>]*>/);
|
||
for (let i = 1; i < liParts.length; i++) {
|
||
const li = liParts[i];
|
||
|
||
const labelMatch = li.match(/<span class="label">([^<]+)<\/span>/);
|
||
if (!labelMatch) continue;
|
||
const dayLabel = labelMatch[1].trim();
|
||
const dayOfWeek = FULL_DAY_NAMES[dayLabel];
|
||
if (dayOfWeek === undefined) continue;
|
||
|
||
// Each time entry is in a <span class='serviceTime'>
|
||
const serviceTimeParts = li.split("<span class='serviceTime'>");
|
||
for (let j = 1; j < serviceTimeParts.length; j++) {
|
||
const st = serviceTimeParts[j];
|
||
const timeMatch = st.match(/<span class='time'>([^<]+)<\/span>/);
|
||
if (!timeMatch) continue;
|
||
const time = convertTo24h(timeMatch[1].trim());
|
||
|
||
const langMatch = st.match(/<span class='language'>\(([^)]+)\)<\/span>/);
|
||
const language = langMatch ? langMatch[1].trim() : 'English';
|
||
|
||
const commentMatch = st.match(/<span class='comment'>([^<]+)<\/span>/);
|
||
const notes = commentMatch ? commentMatch[1].trim() : undefined;
|
||
|
||
results.push({ dayOfWeek, time, language, notes });
|
||
}
|
||
}
|
||
|
||
return results;
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 3: Add parseOtherServices function**
|
||
|
||
Append to the file:
|
||
|
||
```typescript
|
||
/**
|
||
* Parse confessions and adoration from the "Other Services" <ul> block.
|
||
*
|
||
* HTML structure:
|
||
* <ul><li><h5>Other Services</h5></li>
|
||
* <li class="Confessions"><span class="label">Confessions</span>
|
||
* <span class='serviceTime'>Tue: <span class='time'>8:30am-9:00am</span></span>
|
||
* <span class='serviceTime'>Sat: <span class='time'>3:30pm-4:30pm</span> - <span class='comment'>In the Chapel</span></span>
|
||
* </li>
|
||
* <li class="Adoration"><span class="label">Adoration</span>
|
||
* <span class='serviceTime'>Weekdays: <span class='time'>9:00am-6:00pm</span> - <span class='comment'>In the Chapel</span></span>
|
||
* </li>
|
||
* </ul>
|
||
*/
|
||
function parseOtherServices(html: string): { confessions: ParsedConf[]; adorations: ParsedAdoration[] } {
|
||
const otherUlMatch = html.match(/<ul>\s*<li>\s*<h5>Other Services<\/h5>[\s\S]*?<\/ul>/);
|
||
if (!otherUlMatch) return { confessions: [], adorations: [] };
|
||
const otherUl = otherUlMatch[0];
|
||
|
||
function parseServiceItems(liHtml: string): Array<{ dayOfWeek: number; startTime: string; endTime: string; notes?: string }> {
|
||
const items: Array<{ dayOfWeek: number; startTime: string; endTime: string; notes?: string }> = [];
|
||
const stParts = liHtml.split("<span class='serviceTime'>");
|
||
for (let i = 1; i < stParts.length; i++) {
|
||
const st = stParts[i];
|
||
// Text before <span class='time'> contains the day abbreviation and colon
|
||
const dayTimeMatch = st.match(/^([A-Za-z]+):\s*<span class='time'>([^<]+)<\/span>/);
|
||
if (!dayTimeMatch) continue;
|
||
const days = expandDayAbbrev(dayTimeMatch[1].trim());
|
||
if (days.length === 0) continue;
|
||
const [startTime, endTime] = parseTimeRange(dayTimeMatch[2]);
|
||
const commentMatch = st.match(/<span class='comment'>([^<]+)<\/span>/);
|
||
const notes = commentMatch ? commentMatch[1].trim() : undefined;
|
||
for (const dayOfWeek of days) {
|
||
items.push({ dayOfWeek, startTime, endTime, notes });
|
||
}
|
||
}
|
||
return items;
|
||
}
|
||
|
||
const confessions: ParsedConf[] = [];
|
||
const adorations: ParsedAdoration[] = [];
|
||
|
||
const confMatch = otherUl.match(/<li class="Confessions">[\s\S]*?<\/li>/);
|
||
if (confMatch) confessions.push(...parseServiceItems(confMatch[0]));
|
||
|
||
const adorMatch = otherUl.match(/<li class="Adoration">[\s\S]*?<\/li>/);
|
||
if (adorMatch) adorations.push(...parseServiceItems(adorMatch[0]));
|
||
|
||
return { confessions, adorations };
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Smoke-test parsing on a real page**
|
||
|
||
Create a quick test at the end of the file temporarily:
|
||
|
||
```typescript
|
||
// TEMP: smoke test — remove before committing
|
||
if (process.argv[2] === '--test-parse') {
|
||
const testUrl = 'https://discovermass.com/church/st-paul-the-apostle-chino-hills-ca/';
|
||
const html = await (await fetch(testUrl, { headers: { 'User-Agent': USER_AGENT } })).text();
|
||
const church = parseChurch(html);
|
||
const masses = parseMassTimes(html);
|
||
const { confessions, adorations } = parseOtherServices(html);
|
||
console.log('Church:', church);
|
||
console.log('Masses:', masses.length, masses.slice(0, 3));
|
||
console.log('Confessions:', confessions.length, confessions);
|
||
console.log('Adorations:', adorations.length, adorations);
|
||
process.exit(0);
|
||
}
|
||
```
|
||
|
||
Run it:
|
||
```bash
|
||
cd /home/albert/Documents/ScraperControl
|
||
npx tsx scripts/import-discovermass.ts --test-parse
|
||
```
|
||
|
||
Expected output:
|
||
```
|
||
Church: { name: 'St. Paul the Apostle', address: '14085 Peyton Drive', city: 'Chino Hills', state: 'CA', zip: '91709', phone: '(909) 465-5503', website: 'http://www.sptacc.org', lat: 33.996887, lng: -117.732407 }
|
||
Masses: 14 [ { dayOfWeek: 6, time: '17:00', language: 'English', notes: undefined }, ... ]
|
||
Confessions: 4 [...]
|
||
Adorations: 6 [...]
|
||
```
|
||
|
||
If the counts don't match, debug the regex patterns against the raw HTML:
|
||
```bash
|
||
curl -s https://discovermass.com/church/st-paul-the-apostle-chino-hills-ca/ > /tmp/dm-test.html
|
||
# Then inspect the relevant sections manually
|
||
```
|
||
|
||
- [ ] **Step 5: Remove the temp test block and commit**
|
||
|
||
Remove the `if (process.argv[2] === '--test-parse')` block from the file.
|
||
|
||
```bash
|
||
git add scripts/import-discovermass.ts
|
||
git commit -m "feat: add discovermass parsing utilities (church, mass, confession, adoration)"
|
||
```
|
||
|
||
---
|
||
|
||
## Chunk 3: import-discovermass.ts — main import loop
|
||
|
||
### Task 5: Add HTTP helpers + sitemap enumeration
|
||
|
||
**Files:**
|
||
- Modify: `scripts/import-discovermass.ts`
|
||
|
||
- [ ] **Step 1: Add HTTP helpers and loadExistingChurches**
|
||
|
||
Append to the file:
|
||
|
||
```typescript
|
||
// ─── HTTP Helpers ─────────────────────────────────────────────────────────────
|
||
|
||
async function fetchHtml(url: string): Promise<string> {
|
||
const res = await fetch(url, {
|
||
headers: { 'User-Agent': USER_AGENT },
|
||
});
|
||
if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`);
|
||
return res.text();
|
||
}
|
||
|
||
function sleep(ms: number): Promise<void> {
|
||
return new Promise(resolve => setTimeout(resolve, ms));
|
||
}
|
||
|
||
// ─── Sitemap Enumeration ──────────────────────────────────────────────────────
|
||
|
||
/**
|
||
* Fetch all 11 WordPress item sitemaps and return every church URL.
|
||
* No rate limiting needed — only 11 sitemap requests.
|
||
*/
|
||
async function getAllChurchUrls(): Promise<string[]> {
|
||
const urls: string[] = [];
|
||
for (let i = 1; i <= SITEMAP_COUNT; i++) {
|
||
const sitemapUrl = `${SITE_BASE}/wp-sitemap-posts-item-${i}.xml`;
|
||
console.log(`Fetching sitemap ${i}/${SITEMAP_COUNT}...`);
|
||
const xml = await fetchHtml(sitemapUrl);
|
||
const matches = xml.matchAll(/<loc>(https:\/\/discovermass\.com\/church\/[^<]+)<\/loc>/g);
|
||
for (const match of matches) {
|
||
urls.push(match[1]);
|
||
}
|
||
}
|
||
console.log(`Total church URLs: ${urls.length}`);
|
||
return urls;
|
||
}
|
||
|
||
// ─── DB Helpers ───────────────────────────────────────────────────────────────
|
||
|
||
/**
|
||
* Load all US churches from DB into memory for dedup matching.
|
||
* Only loads US churches to keep the array manageable.
|
||
*/
|
||
async function loadExistingChurches(): Promise<ExistingChurch[]> {
|
||
console.log('Loading existing US churches from DB...');
|
||
const churches = await prisma.church.findMany({
|
||
where: { country: 'US' },
|
||
select: {
|
||
id: true, name: true, latitude: true, longitude: true,
|
||
osmId: true, baiduId: true, masstimesId: true,
|
||
orarimesseId: true, massSchedulesPhId: true, philmassId: true,
|
||
horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true,
|
||
messesInfoId: true, bohosluzbyId: true, miserendId: true,
|
||
kerknetId: true, gottesdienstzeitenId: true, discovermassId: true,
|
||
source: true, website: true, phone: true, address: true, country: true,
|
||
},
|
||
});
|
||
console.log(`Loaded ${churches.length} existing US churches`);
|
||
return churches as ExistingChurch[];
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Task 6: Add processChurch + main() + CLI parsing
|
||
|
||
**Files:**
|
||
- Modify: `scripts/import-discovermass.ts`
|
||
|
||
- [ ] **Step 1: Add processChurch function**
|
||
|
||
Append to the file:
|
||
|
||
```typescript
|
||
// ─── Church Processing ────────────────────────────────────────────────────────
|
||
|
||
async function processChurch(
|
||
url: string,
|
||
existingChurches: ExistingChurch[],
|
||
args: CLIArgs,
|
||
stats: ImportStats,
|
||
): Promise<void> {
|
||
const slug = url.replace('https://discovermass.com/church/', '').replace(/\/$/, '');
|
||
stats.total++;
|
||
|
||
try {
|
||
const html = await fetchHtml(url);
|
||
const parsed = parseChurch(html);
|
||
if (!parsed) {
|
||
console.log(` [skip] Could not parse: ${slug}`);
|
||
stats.skipped++;
|
||
return;
|
||
}
|
||
|
||
const masses = parseMassTimes(html);
|
||
const { confessions, adorations } = parseOtherServices(html);
|
||
|
||
if (args.dryRun) {
|
||
console.log(` [dry-run] ${parsed.name} — ${masses.length} masses, ${confessions.length} confessions, ${adorations.length} adorations`);
|
||
return;
|
||
}
|
||
|
||
const candidate = {
|
||
name: parsed.name,
|
||
lat: parsed.lat,
|
||
lng: parsed.lng,
|
||
discovermassId: slug,
|
||
};
|
||
const duplicate = findDuplicateChurch(candidate, existingChurches);
|
||
|
||
if (duplicate) {
|
||
// Update existing church — only fill blank fields, always replace schedules
|
||
const updateData: Record<string, unknown> = { discovermassId: slug };
|
||
if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone;
|
||
if (!duplicate.website && parsed.website) {
|
||
updateData.website = parsed.website;
|
||
updateData.hasWebsite = true;
|
||
}
|
||
if (parsed.lat !== 0 && duplicate.latitude === 0) {
|
||
updateData.latitude = parsed.lat;
|
||
updateData.longitude = parsed.lng;
|
||
}
|
||
|
||
try {
|
||
await prisma.$transaction(async (tx) => {
|
||
await tx.church.update({ where: { id: duplicate.id }, data: updateData });
|
||
|
||
if (masses.length > 0) {
|
||
await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } });
|
||
await tx.massSchedule.createMany({
|
||
data: masses.map(m => ({
|
||
churchId: duplicate.id,
|
||
dayOfWeek: m.dayOfWeek,
|
||
time: m.time,
|
||
language: m.language,
|
||
notes: m.notes ?? null,
|
||
})),
|
||
});
|
||
}
|
||
|
||
if (confessions.length > 0) {
|
||
await tx.confessionSchedule.deleteMany({ where: { churchId: duplicate.id } });
|
||
await tx.confessionSchedule.createMany({
|
||
data: confessions.map(c => ({
|
||
churchId: duplicate.id,
|
||
dayOfWeek: c.dayOfWeek,
|
||
startTime: c.startTime,
|
||
endTime: c.endTime,
|
||
notes: c.notes ?? null,
|
||
})),
|
||
});
|
||
}
|
||
|
||
if (adorations.length > 0) {
|
||
await tx.adorationSchedule.deleteMany({ where: { churchId: duplicate.id } });
|
||
await tx.adorationSchedule.createMany({
|
||
data: adorations.map(a => ({
|
||
churchId: duplicate.id,
|
||
dayOfWeek: a.dayOfWeek,
|
||
startTime: a.startTime,
|
||
endTime: a.endTime,
|
||
notes: a.notes ?? null,
|
||
})),
|
||
});
|
||
}
|
||
|
||
await tx.church.update({
|
||
where: { id: duplicate.id },
|
||
data: { lastScrapedAt: new Date() },
|
||
});
|
||
});
|
||
|
||
// Update in-memory entry for within-run dedup
|
||
duplicate.discovermassId = slug;
|
||
stats.updated++;
|
||
} catch (err) {
|
||
if (err instanceof Error && err.message.includes('Unique constraint')) {
|
||
stats.skipped++;
|
||
return;
|
||
}
|
||
throw err;
|
||
}
|
||
} else {
|
||
// Create new church
|
||
try {
|
||
const church = await prisma.church.create({
|
||
data: {
|
||
name: parsed.name,
|
||
address: parsed.address,
|
||
city: parsed.city,
|
||
state: parsed.state,
|
||
zip: parsed.zip,
|
||
country: 'US',
|
||
phone: parsed.phone,
|
||
website: parsed.website,
|
||
hasWebsite: !!parsed.website,
|
||
latitude: parsed.lat,
|
||
longitude: parsed.lng,
|
||
discovermassId: slug,
|
||
source: 'discovermass',
|
||
},
|
||
});
|
||
|
||
// Add to in-memory array for within-run dedup
|
||
existingChurches.push({
|
||
id: church.id,
|
||
name: parsed.name,
|
||
latitude: parsed.lat,
|
||
longitude: parsed.lng,
|
||
osmId: null,
|
||
baiduId: null,
|
||
masstimesId: null,
|
||
orarimesseId: null,
|
||
massSchedulesPhId: null,
|
||
philmassId: null,
|
||
horariosMisasId: null,
|
||
mszeInfoId: null,
|
||
weekdayMassesId: null,
|
||
messesInfoId: null,
|
||
bohosluzbyId: null,
|
||
miserendId: null,
|
||
kerknetId: null,
|
||
gottesdienstzeitenId: null,
|
||
discovermassId: slug,
|
||
source: 'discovermass',
|
||
website: parsed.website,
|
||
phone: parsed.phone,
|
||
address: parsed.address,
|
||
country: 'US',
|
||
});
|
||
|
||
if (masses.length > 0) {
|
||
await prisma.massSchedule.createMany({
|
||
data: masses.map(m => ({
|
||
churchId: church.id,
|
||
dayOfWeek: m.dayOfWeek,
|
||
time: m.time,
|
||
language: m.language,
|
||
notes: m.notes ?? null,
|
||
})),
|
||
});
|
||
}
|
||
|
||
if (confessions.length > 0) {
|
||
await prisma.confessionSchedule.createMany({
|
||
data: confessions.map(c => ({
|
||
churchId: church.id,
|
||
dayOfWeek: c.dayOfWeek,
|
||
startTime: c.startTime,
|
||
endTime: c.endTime,
|
||
notes: c.notes ?? null,
|
||
})),
|
||
});
|
||
}
|
||
|
||
if (adorations.length > 0) {
|
||
await prisma.adorationSchedule.createMany({
|
||
data: adorations.map(a => ({
|
||
churchId: church.id,
|
||
dayOfWeek: a.dayOfWeek,
|
||
startTime: a.startTime,
|
||
endTime: a.endTime,
|
||
notes: a.notes ?? null,
|
||
})),
|
||
});
|
||
}
|
||
|
||
await prisma.church.update({
|
||
where: { id: church.id },
|
||
data: { lastScrapedAt: new Date() },
|
||
});
|
||
|
||
stats.created++;
|
||
} catch (err) {
|
||
if (err instanceof Error && err.message.includes('Unique constraint')) {
|
||
stats.skipped++;
|
||
return;
|
||
}
|
||
throw err;
|
||
}
|
||
}
|
||
|
||
stats.massSchedulesCreated += masses.length;
|
||
stats.confessionSchedulesCreated += confessions.length;
|
||
stats.adorationSchedulesCreated += adorations.length;
|
||
|
||
console.log(
|
||
` [${duplicate ? 'update' : 'create'}] ${parsed.name} — ` +
|
||
`${masses.length}M ${confessions.length}C ${adorations.length}A — ` +
|
||
`${stats.total} total (${stats.created} new, ${stats.updated} upd, ${stats.errors} err)`
|
||
);
|
||
} catch (err) {
|
||
stats.errors++;
|
||
console.error(` [error] ${slug}: ${err instanceof Error ? err.message : err}`);
|
||
}
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Add parseCLIArgs + main()**
|
||
|
||
Append to the file:
|
||
|
||
```typescript
|
||
// ─── CLI Parsing ──────────────────────────────────────────────────────────────
|
||
|
||
function parseCLIArgs(): CLIArgs {
|
||
const args = process.argv.slice(2);
|
||
const result: CLIArgs = { all: false, dryRun: false };
|
||
for (let i = 0; i < args.length; i++) {
|
||
switch (args[i]) {
|
||
case '--all': result.all = true; break;
|
||
case '--dry-run': result.dryRun = true; break;
|
||
case '--resume-from': result.resumeFrom = parseInt(args[++i], 10); break;
|
||
case '--job-id': result.jobId = args[++i]; break;
|
||
}
|
||
}
|
||
return result;
|
||
}
|
||
|
||
// ─── Main ─────────────────────────────────────────────────────────────────────
|
||
|
||
async function main() {
|
||
const args = parseCLIArgs();
|
||
|
||
if (!args.all) {
|
||
console.error('Usage: npx tsx scripts/import-discovermass.ts --all [--dry-run] [--resume-from N] [--job-id UUID]');
|
||
process.exit(1);
|
||
}
|
||
|
||
// Update job status to 'running' if job-id provided
|
||
if (args.jobId) {
|
||
try {
|
||
await prisma.backgroundJob.update({
|
||
where: { id: args.jobId },
|
||
data: { status: 'running', startedAt: new Date() },
|
||
});
|
||
} catch { /* Job might not exist yet */ }
|
||
}
|
||
|
||
const stats: ImportStats = {
|
||
total: 0, created: 0, updated: 0, skipped: 0, errors: 0,
|
||
massSchedulesCreated: 0, confessionSchedulesCreated: 0, adorationSchedulesCreated: 0,
|
||
};
|
||
|
||
try {
|
||
// Step 1: Enumerate all church URLs from sitemaps
|
||
const urls = await getAllChurchUrls();
|
||
|
||
// Step 2: Load existing US churches for dedup
|
||
const existingChurches = await loadExistingChurches();
|
||
|
||
// Step 3: Apply resume-from offset
|
||
const startIdx = args.resumeFrom ?? 0;
|
||
const churchUrls = urls.slice(startIdx);
|
||
console.log(`\nProcessing ${churchUrls.length} churches (starting from index ${startIdx})...\n`);
|
||
|
||
// Step 4: Process each church with 10s delay between requests
|
||
for (let i = 0; i < churchUrls.length; i++) {
|
||
const url = churchUrls[i];
|
||
const overallIdx = startIdx + i;
|
||
console.log(`[${overallIdx + 1}/${urls.length}] ${url}`);
|
||
|
||
await processChurch(url, existingChurches, args, stats);
|
||
|
||
// Rate limit: 10s delay between church pages (robots.txt Crawl-delay: 10)
|
||
if (i < churchUrls.length - 1) {
|
||
await sleep(REQUEST_DELAY_MS);
|
||
}
|
||
}
|
||
} finally {
|
||
// Always print stats and update job status
|
||
console.log('\n─── Import Complete ───────────────────────────────────────');
|
||
console.log(`Total processed: ${stats.total}`);
|
||
console.log(`Created: ${stats.created}`);
|
||
console.log(`Updated: ${stats.updated}`);
|
||
console.log(`Skipped: ${stats.skipped}`);
|
||
console.log(`Errors: ${stats.errors}`);
|
||
console.log(`Mass schedules: ${stats.massSchedulesCreated}`);
|
||
console.log(`Confession sched: ${stats.confessionSchedulesCreated}`);
|
||
console.log(`Adoration sched: ${stats.adorationSchedulesCreated}`);
|
||
|
||
if (args.jobId) {
|
||
const status = stats.errors > stats.total * 0.1 ? 'failed' : 'completed';
|
||
try {
|
||
await prisma.backgroundJob.update({
|
||
where: { id: args.jobId },
|
||
data: {
|
||
status,
|
||
completedAt: new Date(),
|
||
processed: stats.total,
|
||
succeeded: stats.created + stats.updated,
|
||
failed: stats.errors,
|
||
itemsFound: stats.massSchedulesCreated,
|
||
},
|
||
});
|
||
} catch { /* Ignore */ }
|
||
}
|
||
|
||
await prisma.$disconnect();
|
||
await pool.end();
|
||
}
|
||
}
|
||
|
||
main().catch((err) => {
|
||
console.error('Fatal error:', err);
|
||
process.exit(1);
|
||
});
|
||
```
|
||
|
||
- [ ] **Step 3: Run dry-run on first 3 churches**
|
||
|
||
```bash
|
||
cd /home/albert/Documents/ScraperControl
|
||
# Get a few URLs from the first sitemap to test with
|
||
curl -s https://discovermass.com/wp-sitemap-posts-item-1.xml | grep -o '<loc>[^<]*</loc>' | head -3
|
||
```
|
||
|
||
Then test dry-run:
|
||
```bash
|
||
npx tsx scripts/import-discovermass.ts --all --dry-run --resume-from 0
|
||
```
|
||
|
||
Wait ~30 seconds (3 churches × 10s delay). Expected output:
|
||
```
|
||
Connecting to database: postgresql://...
|
||
Fetching sitemap 1/11...
|
||
...
|
||
Total church URLs: 20284
|
||
Loading existing US churches from DB...
|
||
Loaded XXXX existing US churches
|
||
|
||
Processing 20284 churches (starting from index 0)...
|
||
|
||
[1/20284] https://discovermass.com/church/some-church/
|
||
[dry-run] Some Church — 8 masses, 2 confessions, 3 adorations
|
||
[2/20284] ...
|
||
```
|
||
|
||
Stop with Ctrl+C after a few churches.
|
||
|
||
- [ ] **Step 4: Verify TypeScript compiles**
|
||
|
||
```bash
|
||
npx tsc --noEmit
|
||
```
|
||
|
||
Fix any type errors before committing.
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```bash
|
||
git add scripts/import-discovermass.ts
|
||
git commit -m "feat: add import-discovermass.ts — USA church importer with 10s crawl delay"
|
||
```
|
||
|
||
---
|
||
|
||
## Chunk 4: Integration + Docker deployment
|
||
|
||
### Task 7: package.json + scheduler integration
|
||
|
||
**Files:**
|
||
- Modify: `package.json`
|
||
- Modify: `scripts/scheduler.ts`
|
||
|
||
- [ ] **Step 1: Add import:discovermass to package.json**
|
||
|
||
Open `package.json`. Find the `"scripts"` section. After `"import:masstimes-api"` (or whichever is last in the import group), add:
|
||
```json
|
||
"import:discovermass": "tsx scripts/import-discovermass.ts",
|
||
```
|
||
|
||
- [ ] **Step 2: Add discovermass-import case to getJobCommand in scheduler.ts**
|
||
|
||
Open `scripts/scheduler.ts`. Find the `getJobCommand` function. After the `masstimes-api-import` case block, add:
|
||
|
||
```typescript
|
||
case 'discovermass-import': {
|
||
const args = ['tsx', 'scripts/import-discovermass.ts', '--all'];
|
||
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
||
return { command: 'npx', args };
|
||
// Note: --job-id is appended by startJobProcess() in the scheduler, not here.
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 3: Add discovermass-import to PIPELINE_GROUPS**
|
||
|
||
In `scripts/scheduler.ts`, find `PIPELINE_GROUPS`. In the first group's `phases` array, after the `masstimes-api-import` entry, add:
|
||
|
||
```typescript
|
||
{ name: 'discovermass-import', type: 'discovermass-import', config: {} },
|
||
```
|
||
|
||
- [ ] **Step 4: Verify TypeScript compiles**
|
||
|
||
```bash
|
||
cd /home/albert/Documents/ScraperControl
|
||
npx tsc --noEmit
|
||
```
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```bash
|
||
git add package.json scripts/scheduler.ts
|
||
git commit -m "feat: add discovermass-import to scheduler pipeline and package.json"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 8: Deploy to Docker and run
|
||
|
||
The importer will run for ~56 hours in the Docker scheduler container. The scheduler picks it up as part of the PIPELINE_GROUPS sequence.
|
||
|
||
- [ ] **Step 1: Deploy to Docker directory**
|
||
|
||
```bash
|
||
bash /home/albert/Documents/ScraperControl/scripts/deploy-local.sh
|
||
```
|
||
|
||
Or manually:
|
||
```bash
|
||
rsync -avz --exclude node_modules --exclude .next --exclude '.env*' \
|
||
--exclude .git --exclude .claude --exclude .playwright-mcp \
|
||
~/Documents/ScraperControl/ /opt/docker/scraper-control/
|
||
```
|
||
|
||
- [ ] **Step 2: Rebuild Docker images to pick up new script**
|
||
|
||
```bash
|
||
cd /opt/docker/scraper-control
|
||
docker compose build scheduler
|
||
```
|
||
|
||
Expected: build completes without errors.
|
||
|
||
- [ ] **Step 3: Create a manual job via the admin API to trigger the import immediately**
|
||
|
||
The scheduler can run imports as manual jobs (priority over pipeline):
|
||
|
||
```bash
|
||
curl -X POST http://localhost:3001/api/admin/jobs \
|
||
-H "Content-Type: application/json" \
|
||
-H "X-Admin-Key: $(grep ADMIN_API_KEY /opt/docker/scraper-control/.env | cut -d= -f2)" \
|
||
-d '{"type": "discovermass-import", "config": {}}'
|
||
```
|
||
|
||
Expected: `{"id": "...", "status": "pending", ...}`
|
||
|
||
- [ ] **Step 4: Restart the scheduler to pick up the new job**
|
||
|
||
```bash
|
||
cd /opt/docker/scraper-control
|
||
docker compose restart scheduler
|
||
```
|
||
|
||
- [ ] **Step 5: Monitor the job**
|
||
|
||
Check logs:
|
||
```bash
|
||
docker compose logs -f scheduler --since 1m
|
||
```
|
||
|
||
Expected output:
|
||
```
|
||
Fetching sitemap 1/11...
|
||
...
|
||
Total church URLs: 20284
|
||
Loading existing US churches from DB...
|
||
[1/20284] https://discovermass.com/church/...
|
||
[create] St. Paul the Apostle — 14M 4C 6A — 1 total (0 new, 1 upd, 0 err)
|
||
```
|
||
|
||
The St. Paul the Apostle church we seeded earlier should show as `[update]` (matched by name+proximity), linking the `discovermassId` to the existing record.
|
||
|
||
- [ ] **Step 6: Verify St. Paul the Apostle was matched (not duplicated)**
|
||
|
||
After the importer processes the Chino Hills area (~a few hours in since it's alphabetical by state/slug), run:
|
||
|
||
```bash
|
||
psql postgresql://postgres:postgres@192.168.0.145:5434/nearestmass \
|
||
-c "SELECT name, city, state, discovermass_id, created_at FROM churches WHERE name ILIKE '%Paul%' AND city = 'Chino Hills';"
|
||
```
|
||
|
||
Expected: 1 row with `discovermass_id = 'st-paul-the-apostle-chino-hills-ca'` and `created_at` from the earlier seed (not a new timestamp — it's an update, not a create).
|
||
|
||
- [ ] **Step 7: Let the full run complete (~56 hours)**
|
||
|
||
The scheduler will log progress. You can check status anytime:
|
||
|
||
```bash
|
||
docker compose logs scheduler --since 10m | tail -20
|
||
```
|
||
|
||
Or via the admin dashboard at `http://192.168.0.241:3001` — the job will appear in the Jobs tab with status `running` and progress tracked in the `processed`/`succeeded`/`failed` fields.
|
||
|
||
Final expected stats after completion:
|
||
```
|
||
Total processed: 20284
|
||
Created: ~17000-19000 (new US churches)
|
||
Updated: ~1000-3000 (matched against OSM/MassTimes churches)
|
||
Errors: < 50 (network blips)
|
||
Mass schedules: ~70000-90000
|
||
Confession sched: ~30000-50000
|
||
Adoration sched: ~10000-20000
|
||
```
|