Files
ScraperControl/docs/superpowers/specs/2026-03-10-discovermass-design.md
albertfj114 bbef80a782 docs: add discovermass.com importer spec and implementation plan
20,284 US churches with mass/confession/adoration schedules.
10s crawl delay (robots.txt), Docker deployment via scheduler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 21:49:51 -04:00

234 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: DiscoverMass.com Importer (USA)
## Overview
Import 20,284 US Catholic churches with full schedules from discovermass.com. This fills the largest gap in US coverage — MassTimes.org runs with `--skip-us`, leaving most US churches absent from the database.
**Note:** St. Paul the Apostle, Chino Hills CA (the motivating example) was manually seeded. The importer will match it by proximity+name and link `discovermassId` without duplicating it.
---
## Source
- **Site**: https://discovermass.com
- **Coverage**: USA only
- **Data**: 20,284 churches with name, address, phone, website, coordinates, mass times, confessions, adoration
- **Backend**: WordPress "javo-directory" theme + MassTimes.org data provider
- **robots.txt**: `Crawl-delay: 10` — must be followed
---
## Enumeration Strategy
11 WordPress item sitemaps, 2000 entries each (last has 284):
```
https://discovermass.com/wp-sitemap-posts-item-{1..11}.xml
```
Extract all `<loc>` URLs. URL pattern:
```
https://discovermass.com/church/{slug}/
```
Slug becomes `discovermassId` (e.g. `"st-paul-the-apostle-chino-hills-ca"`).
No rate limiting on sitemap fetches (11 requests total).
---
## HTML Parsing
Server-rendered HTML. No JS execution needed.
### Name
`<meta property="og:title" content="Church Name" />`
### Address
Raw text node before `<br /><br />` in the address block:
```
14085 Peyton Drive, Chino Hills, CA 91709
```
Parse: split on `", "` — last segment is `"{STATE} {ZIP}"`, second-to-last is city, rest is street.
### Phone + Website + Coordinates
From `<div id="sidebar-info">`:
```html
<span class='side-phone attribute'>(909) 465-5503</span>
<span class='side-directions attribute'>
<a href='https://maps.google.com/maps?daddr=33.996887,-117.732407&ll='>Directions</a>
</span>
<span class='side-website attribute'>
<a href='http://www.sptacc.org'>Visit Website</a>
</span>
```
Extract lat/lng from `daddr={lat},{lng}` in the Directions href.
### Mass Schedule
The mass schedule `<ul>` contains `<li>` with `<h5>Mass Times</h5>` as first item, then one `<li>` per day:
```html
<ul>
<li><h5>Mass Times</h5></li>
<li class=""><span class="label">Saturday</span>
<span class='serviceTime'><span class='time'>5:00pm</span></span>,
<span class='serviceTime'><span class='time'>7:00pm</span> <span class='language'>(Vietnamese)</span></span>
</li>
<li class=""><span class="label">Friday</span>
<span class='serviceTime'><span class='time'>8:00am</span></span>,
<span class='serviceTime'><span class='time'>7:00pm</span> - <span class='comment'>1st Fridays</span></span>
</li>
</ul>
```
Per `<li>`:
- **Day**: `.label` text → full English day name → `dayOfWeek` integer
- **Per `.serviceTime`**: `.time` text → 24h HH:MM; `.language` text → language; `.comment` text → notes
### Other Services (Confessions + Adoration)
Separate `<ul>` containing `<li class="Confessions">` and `<li class="Adoration">`:
```html
<li class="Confessions"><span class="label">Confessions</span>
<span class='serviceTime'>Tue: <span class='time'>8:30am-9:00am</span></span>
<span class='serviceTime'>Sat: <span class='time'>3:30pm-4:30pm</span> - <span class='comment'>In the Chapel</span></span>
</li>
<li class="Adoration"><span class="label">Adoration</span>
<span class='serviceTime'>Weekdays: <span class='time'>9:00am-6:00pm</span> - <span class='comment'>In the Chapel</span></span>
</li>
```
Per `.serviceTime` inside Confessions/Adoration:
- **Day prefix** before `:` — abbreviated form ("Tue", "Thr", "Sat", "Fri", "Weekdays", "Daily", "Sun", "Mon", "Wed", "Thu")
- **Time range**: `{start}-{end}` in `<span class='time'>`
- **Notes**: `.comment` text
---
## Day Mappings
### Full names (mass schedule)
| Label | dayOfWeek |
|---|---|
| Sunday | 0 |
| Monday | 1 |
| Tuesday | 2 |
| Wednesday | 3 |
| Thursday | 4 |
| Friday | 5 |
| Saturday | 6 |
### Abbreviated names (confessions/adoration)
| Prefix | dayOfWeek(s) |
|---|---|
| Sun | [0] |
| Mon | [1] |
| Tue | [2] |
| Wed | [3] |
| Thr / Thu | [4] |
| Fri | [5] |
| Sat | [6] |
| Weekdays | [1,2,3,4,5] |
| Daily | [0,1,2,3,4,5,6] |
### Time conversion
`"5:00pm"``"17:00"`, `"12:00pm"``"12:00"`, `"12:00am"``"00:00"`. Strip leading zeros on input hour then pad to 2 digits in output.
---
## Matching Strategy
1. `discovermassId` exact match (re-run safety — also matches manually seeded St. Paul)
2. Name + proximity (200m) against existing US churches (via `findDuplicateChurch`)
3. Unmatched: create new church, `country: 'US'`, `source: 'discovermass'`, `latitude`/`longitude` from `daddr` (or `0` if missing)
When updating an existing church: only fill blank fields (don't overwrite existing phone/website). Always replace mass/confession/adoration schedules.
---
## Schema Addition
```prisma
discovermassId String? @unique @map("discovermass_id")
@@index([discovermassId])
```
---
## church-matcher.ts additions
`ExistingChurch`:
```typescript
discovermassId: string | null;
```
`ChurchCandidate`:
```typescript
discovermassId?: string;
```
New pass before the proximity pass:
```typescript
// Pass N: discovermassId exact match
if (candidate.discovermassId) {
const match = existingChurches.find(c => c.discovermassId === candidate.discovermassId);
if (match) return match;
}
```
---
## CLI
```bash
npx tsx scripts/import-discovermass.ts --all
npx tsx scripts/import-discovermass.ts --all --dry-run
npx tsx scripts/import-discovermass.ts --all --resume-from 5000
npx tsx scripts/import-discovermass.ts --all --job-id {uuid}
```
---
## Scheduler Integration
Add to `PIPELINE_GROUPS[0].phases` after `masstimes-api-import`:
```typescript
{ name: 'discovermass-import', type: 'discovermass-import', config: {} },
```
Add to `getJobCommand()`:
```typescript
case 'discovermass-import': {
const args = ['tsx', 'scripts/import-discovermass.ts', '--all'];
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
return { command: 'npx', args };
}
```
Add to `package.json`:
```json
"import:discovermass": "tsx scripts/import-discovermass.ts"
```
---
## Rate Limiting
- **Sitemaps**: No delay (11 requests)
- **Church pages**: 10,000ms between requests (respects `Crawl-delay: 10`)
- **Total runtime**: ~56 hours (20,284 × 10s)
- **Deployment**: Docker scheduler container (`npm run scheduler``discovermass-import` job)
---
## Estimated Scale
| Metric | Value |
|---|---|
| Churches | 20,284 (majority new for US) |
| Mass schedules | ~80,000 (avg 4 per church) |
| Confession schedules | ~40,000 (avg 2 per church) |
| Runtime | ~56 hours |
| Coordinates | Yes (from Google Maps daddr) |