Files
ScraperControl/docs/superpowers/specs/2026-03-10-discovermass-design.md
albertfj114 bbef80a782 docs: add discovermass.com importer spec and implementation plan
20,284 US churches with mass/confession/adoration schedules.
10s crawl delay (robots.txt), Docker deployment via scheduler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 21:49:51 -04:00

6.5 KiB
Raw Permalink Blame History

Design: DiscoverMass.com Importer (USA)

Overview

Import 20,284 US Catholic churches with full schedules from discovermass.com. This fills the largest gap in US coverage — MassTimes.org runs with --skip-us, leaving most US churches absent from the database.

Note: St. Paul the Apostle, Chino Hills CA (the motivating example) was manually seeded. The importer will match it by proximity+name and link discovermassId without duplicating it.


Source

  • Site: https://discovermass.com
  • Coverage: USA only
  • Data: 20,284 churches with name, address, phone, website, coordinates, mass times, confessions, adoration
  • Backend: WordPress "javo-directory" theme + MassTimes.org data provider
  • robots.txt: Crawl-delay: 10 — must be followed

Enumeration Strategy

11 WordPress item sitemaps, 2000 entries each (last has 284):

https://discovermass.com/wp-sitemap-posts-item-{1..11}.xml

Extract all <loc> URLs. URL pattern:

https://discovermass.com/church/{slug}/

Slug becomes discovermassId (e.g. "st-paul-the-apostle-chino-hills-ca").

No rate limiting on sitemap fetches (11 requests total).


HTML Parsing

Server-rendered HTML. No JS execution needed.

Name

<meta property="og:title" content="Church Name" />

Address

Raw text node before <br /><br /> in the address block:

14085 Peyton Drive, Chino Hills, CA 91709

Parse: split on ", " — last segment is "{STATE} {ZIP}", second-to-last is city, rest is street.

Phone + Website + Coordinates

From <div id="sidebar-info">:

<span class='side-phone attribute'>(909) 465-5503</span>
<span class='side-directions attribute'>
  <a href='https://maps.google.com/maps?daddr=33.996887,-117.732407&ll='>Directions</a>
</span>
<span class='side-website attribute'>
  <a href='http://www.sptacc.org'>Visit Website</a>
</span>

Extract lat/lng from daddr={lat},{lng} in the Directions href.

Mass Schedule

The mass schedule <ul> contains <li> with <h5>Mass Times</h5> as first item, then one <li> per day:

<ul>
  <li><h5>Mass Times</h5></li>
  <li class=""><span class="label">Saturday</span>
    <span class='serviceTime'><span class='time'>5:00pm</span></span>,
    <span class='serviceTime'><span class='time'>7:00pm</span> <span class='language'>(Vietnamese)</span></span>
  </li>
  <li class=""><span class="label">Friday</span>
    <span class='serviceTime'><span class='time'>8:00am</span></span>,
    <span class='serviceTime'><span class='time'>7:00pm</span> - <span class='comment'>1st Fridays</span></span>
  </li>
</ul>

Per <li>:

  • Day: .label text → full English day name → dayOfWeek integer
  • Per .serviceTime: .time text → 24h HH:MM; .language text → language; .comment text → notes

Other Services (Confessions + Adoration)

Separate <ul> containing <li class="Confessions"> and <li class="Adoration">:

<li class="Confessions"><span class="label">Confessions</span>
  <span class='serviceTime'>Tue: <span class='time'>8:30am-9:00am</span></span>
  <span class='serviceTime'>Sat: <span class='time'>3:30pm-4:30pm</span> - <span class='comment'>In the Chapel</span></span>
</li>
<li class="Adoration"><span class="label">Adoration</span>
  <span class='serviceTime'>Weekdays: <span class='time'>9:00am-6:00pm</span> - <span class='comment'>In the Chapel</span></span>
</li>

Per .serviceTime inside Confessions/Adoration:

  • Day prefix before : — abbreviated form ("Tue", "Thr", "Sat", "Fri", "Weekdays", "Daily", "Sun", "Mon", "Wed", "Thu")
  • Time range: {start}-{end} in <span class='time'>
  • Notes: .comment text

Day Mappings

Full names (mass schedule)

Label dayOfWeek
Sunday 0
Monday 1
Tuesday 2
Wednesday 3
Thursday 4
Friday 5
Saturday 6

Abbreviated names (confessions/adoration)

Prefix dayOfWeek(s)
Sun [0]
Mon [1]
Tue [2]
Wed [3]
Thr / Thu [4]
Fri [5]
Sat [6]
Weekdays [1,2,3,4,5]
Daily [0,1,2,3,4,5,6]

Time conversion

"5:00pm""17:00", "12:00pm""12:00", "12:00am""00:00". Strip leading zeros on input hour then pad to 2 digits in output.


Matching Strategy

  1. discovermassId exact match (re-run safety — also matches manually seeded St. Paul)
  2. Name + proximity (200m) against existing US churches (via findDuplicateChurch)
  3. Unmatched: create new church, country: 'US', source: 'discovermass', latitude/longitude from daddr (or 0 if missing)

When updating an existing church: only fill blank fields (don't overwrite existing phone/website). Always replace mass/confession/adoration schedules.


Schema Addition

discovermassId  String?  @unique @map("discovermass_id")
@@index([discovermassId])

church-matcher.ts additions

ExistingChurch:

discovermassId: string | null;

ChurchCandidate:

discovermassId?: string;

New pass before the proximity pass:

// Pass N: discovermassId exact match
if (candidate.discovermassId) {
  const match = existingChurches.find(c => c.discovermassId === candidate.discovermassId);
  if (match) return match;
}

CLI

npx tsx scripts/import-discovermass.ts --all
npx tsx scripts/import-discovermass.ts --all --dry-run
npx tsx scripts/import-discovermass.ts --all --resume-from 5000
npx tsx scripts/import-discovermass.ts --all --job-id {uuid}

Scheduler Integration

Add to PIPELINE_GROUPS[0].phases after masstimes-api-import:

{ name: 'discovermass-import', type: 'discovermass-import', config: {} },

Add to getJobCommand():

case 'discovermass-import': {
  const args = ['tsx', 'scripts/import-discovermass.ts', '--all'];
  if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
  return { command: 'npx', args };
}

Add to package.json:

"import:discovermass": "tsx scripts/import-discovermass.ts"

Rate Limiting

  • Sitemaps: No delay (11 requests)
  • Church pages: 10,000ms between requests (respects Crawl-delay: 10)
  • Total runtime: ~56 hours (20,284 × 10s)
  • Deployment: Docker scheduler container (npm run schedulerdiscovermass-import job)

Estimated Scale

Metric Value
Churches 20,284 (majority new for US)
Mass schedules ~80,000 (avg 4 per church)
Confession schedules ~40,000 (avg 2 per church)
Runtime ~56 hours
Coordinates Yes (from Google Maps daddr)