20,284 US churches with mass/confession/adoration schedules. 10s crawl delay (robots.txt), Docker deployment via scheduler. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.5 KiB
Design: DiscoverMass.com Importer (USA)
Overview
Import 20,284 US Catholic churches with full schedules from discovermass.com. This fills the largest gap in US coverage — MassTimes.org runs with --skip-us, leaving most US churches absent from the database.
Note: St. Paul the Apostle, Chino Hills CA (the motivating example) was manually seeded. The importer will match it by proximity+name and link discovermassId without duplicating it.
Source
- Site: https://discovermass.com
- Coverage: USA only
- Data: 20,284 churches with name, address, phone, website, coordinates, mass times, confessions, adoration
- Backend: WordPress "javo-directory" theme + MassTimes.org data provider
- robots.txt:
Crawl-delay: 10— must be followed
Enumeration Strategy
11 WordPress item sitemaps, 2000 entries each (last has 284):
https://discovermass.com/wp-sitemap-posts-item-{1..11}.xml
Extract all <loc> URLs. URL pattern:
https://discovermass.com/church/{slug}/
Slug becomes discovermassId (e.g. "st-paul-the-apostle-chino-hills-ca").
No rate limiting on sitemap fetches (11 requests total).
HTML Parsing
Server-rendered HTML. No JS execution needed.
Name
<meta property="og:title" content="Church Name" />
Address
Raw text node before <br /><br /> in the address block:
14085 Peyton Drive, Chino Hills, CA 91709
Parse: split on ", " — last segment is "{STATE} {ZIP}", second-to-last is city, rest is street.
Phone + Website + Coordinates
From <div id="sidebar-info">:
<span class='side-phone attribute'>(909) 465-5503</span>
<span class='side-directions attribute'>
<a href='https://maps.google.com/maps?daddr=33.996887,-117.732407&ll='>Directions</a>
</span>
<span class='side-website attribute'>
<a href='http://www.sptacc.org'>Visit Website</a>
</span>
Extract lat/lng from daddr={lat},{lng} in the Directions href.
Mass Schedule
The mass schedule <ul> contains <li> with <h5>Mass Times</h5> as first item, then one <li> per day:
<ul>
<li><h5>Mass Times</h5></li>
<li class=""><span class="label">Saturday</span>
<span class='serviceTime'><span class='time'>5:00pm</span></span>,
<span class='serviceTime'><span class='time'>7:00pm</span> <span class='language'>(Vietnamese)</span></span>
</li>
<li class=""><span class="label">Friday</span>
<span class='serviceTime'><span class='time'>8:00am</span></span>,
<span class='serviceTime'><span class='time'>7:00pm</span> - <span class='comment'>1st Fridays</span></span>
</li>
</ul>
Per <li>:
- Day:
.labeltext → full English day name →dayOfWeekinteger - Per
.serviceTime:.timetext → 24h HH:MM;.languagetext → language;.commenttext → notes
Other Services (Confessions + Adoration)
Separate <ul> containing <li class="Confessions"> and <li class="Adoration">:
<li class="Confessions"><span class="label">Confessions</span>
<span class='serviceTime'>Tue: <span class='time'>8:30am-9:00am</span></span>
<span class='serviceTime'>Sat: <span class='time'>3:30pm-4:30pm</span> - <span class='comment'>In the Chapel</span></span>
</li>
<li class="Adoration"><span class="label">Adoration</span>
<span class='serviceTime'>Weekdays: <span class='time'>9:00am-6:00pm</span> - <span class='comment'>In the Chapel</span></span>
</li>
Per .serviceTime inside Confessions/Adoration:
- Day prefix before
:— abbreviated form ("Tue", "Thr", "Sat", "Fri", "Weekdays", "Daily", "Sun", "Mon", "Wed", "Thu") - Time range:
{start}-{end}in<span class='time'> - Notes:
.commenttext
Day Mappings
Full names (mass schedule)
| Label | dayOfWeek |
|---|---|
| Sunday | 0 |
| Monday | 1 |
| Tuesday | 2 |
| Wednesday | 3 |
| Thursday | 4 |
| Friday | 5 |
| Saturday | 6 |
Abbreviated names (confessions/adoration)
| Prefix | dayOfWeek(s) |
|---|---|
| Sun | [0] |
| Mon | [1] |
| Tue | [2] |
| Wed | [3] |
| Thr / Thu | [4] |
| Fri | [5] |
| Sat | [6] |
| Weekdays | [1,2,3,4,5] |
| Daily | [0,1,2,3,4,5,6] |
Time conversion
"5:00pm" → "17:00", "12:00pm" → "12:00", "12:00am" → "00:00". Strip leading zeros on input hour then pad to 2 digits in output.
Matching Strategy
discovermassIdexact match (re-run safety — also matches manually seeded St. Paul)- Name + proximity (200m) against existing US churches (via
findDuplicateChurch) - Unmatched: create new church,
country: 'US',source: 'discovermass',latitude/longitudefromdaddr(or0if missing)
When updating an existing church: only fill blank fields (don't overwrite existing phone/website). Always replace mass/confession/adoration schedules.
Schema Addition
discovermassId String? @unique @map("discovermass_id")
@@index([discovermassId])
church-matcher.ts additions
ExistingChurch:
discovermassId: string | null;
ChurchCandidate:
discovermassId?: string;
New pass before the proximity pass:
// Pass N: discovermassId exact match
if (candidate.discovermassId) {
const match = existingChurches.find(c => c.discovermassId === candidate.discovermassId);
if (match) return match;
}
CLI
npx tsx scripts/import-discovermass.ts --all
npx tsx scripts/import-discovermass.ts --all --dry-run
npx tsx scripts/import-discovermass.ts --all --resume-from 5000
npx tsx scripts/import-discovermass.ts --all --job-id {uuid}
Scheduler Integration
Add to PIPELINE_GROUPS[0].phases after masstimes-api-import:
{ name: 'discovermass-import', type: 'discovermass-import', config: {} },
Add to getJobCommand():
case 'discovermass-import': {
const args = ['tsx', 'scripts/import-discovermass.ts', '--all'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
Add to package.json:
"import:discovermass": "tsx scripts/import-discovermass.ts"
Rate Limiting
- Sitemaps: No delay (11 requests)
- Church pages: 10,000ms between requests (respects
Crawl-delay: 10) - Total runtime: ~56 hours (20,284 × 10s)
- Deployment: Docker scheduler container (
npm run scheduler→discovermass-importjob)
Estimated Scale
| Metric | Value |
|---|---|
| Churches | 20,284 (majority new for US) |
| Mass schedules | ~80,000 (avg 4 per church) |
| Confession schedules | ~40,000 (avg 2 per church) |
| Runtime | ~56 hours |
| Coordinates | Yes (from Google Maps daddr) |