docs: add discovermass.com importer spec and implementation plan
20,284 US churches with mass/confession/adoration schedules. 10s crawl delay (robots.txt), Docker deployment via scheduler. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
233
docs/superpowers/specs/2026-03-10-discovermass-design.md
Normal file
233
docs/superpowers/specs/2026-03-10-discovermass-design.md
Normal file
@@ -0,0 +1,233 @@
|
||||
# Design: DiscoverMass.com Importer (USA)
|
||||
|
||||
## Overview
|
||||
|
||||
Import 20,284 US Catholic churches with full schedules from discovermass.com. This fills the largest gap in US coverage — MassTimes.org runs with `--skip-us`, leaving most US churches absent from the database.
|
||||
|
||||
**Note:** St. Paul the Apostle, Chino Hills CA (the motivating example) was manually seeded. The importer will match it by proximity+name and link `discovermassId` without duplicating it.
|
||||
|
||||
---
|
||||
|
||||
## Source
|
||||
|
||||
- **Site**: https://discovermass.com
|
||||
- **Coverage**: USA only
|
||||
- **Data**: 20,284 churches with name, address, phone, website, coordinates, mass times, confessions, adoration
|
||||
- **Backend**: WordPress "javo-directory" theme + MassTimes.org data provider
|
||||
- **robots.txt**: `Crawl-delay: 10` — must be followed
|
||||
|
||||
---
|
||||
|
||||
## Enumeration Strategy
|
||||
|
||||
11 WordPress item sitemaps, 2000 entries each (last has 284):
|
||||
|
||||
```
|
||||
https://discovermass.com/wp-sitemap-posts-item-{1..11}.xml
|
||||
```
|
||||
|
||||
Extract all `<loc>` URLs. URL pattern:
|
||||
```
|
||||
https://discovermass.com/church/{slug}/
|
||||
```
|
||||
|
||||
Slug becomes `discovermassId` (e.g. `"st-paul-the-apostle-chino-hills-ca"`).
|
||||
|
||||
No rate limiting on sitemap fetches (11 requests total).
|
||||
|
||||
---
|
||||
|
||||
## HTML Parsing
|
||||
|
||||
Server-rendered HTML. No JS execution needed.
|
||||
|
||||
### Name
|
||||
`<meta property="og:title" content="Church Name" />`
|
||||
|
||||
### Address
|
||||
Raw text node before `<br /><br />` in the address block:
|
||||
```
|
||||
14085 Peyton Drive, Chino Hills, CA 91709
|
||||
```
|
||||
Parse: split on `", "` — last segment is `"{STATE} {ZIP}"`, second-to-last is city, rest is street.
|
||||
|
||||
### Phone + Website + Coordinates
|
||||
From `<div id="sidebar-info">`:
|
||||
```html
|
||||
<span class='side-phone attribute'>(909) 465-5503</span>
|
||||
<span class='side-directions attribute'>
|
||||
<a href='https://maps.google.com/maps?daddr=33.996887,-117.732407&ll='>Directions</a>
|
||||
</span>
|
||||
<span class='side-website attribute'>
|
||||
<a href='http://www.sptacc.org'>Visit Website</a>
|
||||
</span>
|
||||
```
|
||||
|
||||
Extract lat/lng from `daddr={lat},{lng}` in the Directions href.
|
||||
|
||||
### Mass Schedule
|
||||
The mass schedule `<ul>` contains `<li>` with `<h5>Mass Times</h5>` as first item, then one `<li>` per day:
|
||||
|
||||
```html
|
||||
<ul>
|
||||
<li><h5>Mass Times</h5></li>
|
||||
<li class=""><span class="label">Saturday</span>
|
||||
<span class='serviceTime'><span class='time'>5:00pm</span></span>,
|
||||
<span class='serviceTime'><span class='time'>7:00pm</span> <span class='language'>(Vietnamese)</span></span>
|
||||
</li>
|
||||
<li class=""><span class="label">Friday</span>
|
||||
<span class='serviceTime'><span class='time'>8:00am</span></span>,
|
||||
<span class='serviceTime'><span class='time'>7:00pm</span> - <span class='comment'>1st Fridays</span></span>
|
||||
</li>
|
||||
</ul>
|
||||
```
|
||||
|
||||
Per `<li>`:
|
||||
- **Day**: `.label` text → full English day name → `dayOfWeek` integer
|
||||
- **Per `.serviceTime`**: `.time` text → 24h HH:MM; `.language` text → language; `.comment` text → notes
|
||||
|
||||
### Other Services (Confessions + Adoration)
|
||||
Separate `<ul>` containing `<li class="Confessions">` and `<li class="Adoration">`:
|
||||
|
||||
```html
|
||||
<li class="Confessions"><span class="label">Confessions</span>
|
||||
<span class='serviceTime'>Tue: <span class='time'>8:30am-9:00am</span></span>
|
||||
<span class='serviceTime'>Sat: <span class='time'>3:30pm-4:30pm</span> - <span class='comment'>In the Chapel</span></span>
|
||||
</li>
|
||||
<li class="Adoration"><span class="label">Adoration</span>
|
||||
<span class='serviceTime'>Weekdays: <span class='time'>9:00am-6:00pm</span> - <span class='comment'>In the Chapel</span></span>
|
||||
</li>
|
||||
```
|
||||
|
||||
Per `.serviceTime` inside Confessions/Adoration:
|
||||
- **Day prefix** before `:` — abbreviated form ("Tue", "Thr", "Sat", "Fri", "Weekdays", "Daily", "Sun", "Mon", "Wed", "Thu")
|
||||
- **Time range**: `{start}-{end}` in `<span class='time'>`
|
||||
- **Notes**: `.comment` text
|
||||
|
||||
---
|
||||
|
||||
## Day Mappings
|
||||
|
||||
### Full names (mass schedule)
|
||||
| Label | dayOfWeek |
|
||||
|---|---|
|
||||
| Sunday | 0 |
|
||||
| Monday | 1 |
|
||||
| Tuesday | 2 |
|
||||
| Wednesday | 3 |
|
||||
| Thursday | 4 |
|
||||
| Friday | 5 |
|
||||
| Saturday | 6 |
|
||||
|
||||
### Abbreviated names (confessions/adoration)
|
||||
| Prefix | dayOfWeek(s) |
|
||||
|---|---|
|
||||
| Sun | [0] |
|
||||
| Mon | [1] |
|
||||
| Tue | [2] |
|
||||
| Wed | [3] |
|
||||
| Thr / Thu | [4] |
|
||||
| Fri | [5] |
|
||||
| Sat | [6] |
|
||||
| Weekdays | [1,2,3,4,5] |
|
||||
| Daily | [0,1,2,3,4,5,6] |
|
||||
|
||||
### Time conversion
|
||||
`"5:00pm"` → `"17:00"`, `"12:00pm"` → `"12:00"`, `"12:00am"` → `"00:00"`. Strip leading zeros on input hour then pad to 2 digits in output.
|
||||
|
||||
---
|
||||
|
||||
## Matching Strategy
|
||||
|
||||
1. `discovermassId` exact match (re-run safety — also matches manually seeded St. Paul)
|
||||
2. Name + proximity (200m) against existing US churches (via `findDuplicateChurch`)
|
||||
3. Unmatched: create new church, `country: 'US'`, `source: 'discovermass'`, `latitude`/`longitude` from `daddr` (or `0` if missing)
|
||||
|
||||
When updating an existing church: only fill blank fields (don't overwrite existing phone/website). Always replace mass/confession/adoration schedules.
|
||||
|
||||
---
|
||||
|
||||
## Schema Addition
|
||||
|
||||
```prisma
|
||||
discovermassId String? @unique @map("discovermass_id")
|
||||
@@index([discovermassId])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## church-matcher.ts additions
|
||||
|
||||
`ExistingChurch`:
|
||||
```typescript
|
||||
discovermassId: string | null;
|
||||
```
|
||||
|
||||
`ChurchCandidate`:
|
||||
```typescript
|
||||
discovermassId?: string;
|
||||
```
|
||||
|
||||
New pass before the proximity pass:
|
||||
```typescript
|
||||
// Pass N: discovermassId exact match
|
||||
if (candidate.discovermassId) {
|
||||
const match = existingChurches.find(c => c.discovermassId === candidate.discovermassId);
|
||||
if (match) return match;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CLI
|
||||
|
||||
```bash
|
||||
npx tsx scripts/import-discovermass.ts --all
|
||||
npx tsx scripts/import-discovermass.ts --all --dry-run
|
||||
npx tsx scripts/import-discovermass.ts --all --resume-from 5000
|
||||
npx tsx scripts/import-discovermass.ts --all --job-id {uuid}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scheduler Integration
|
||||
|
||||
Add to `PIPELINE_GROUPS[0].phases` after `masstimes-api-import`:
|
||||
```typescript
|
||||
{ name: 'discovermass-import', type: 'discovermass-import', config: {} },
|
||||
```
|
||||
|
||||
Add to `getJobCommand()`:
|
||||
```typescript
|
||||
case 'discovermass-import': {
|
||||
const args = ['tsx', 'scripts/import-discovermass.ts', '--all'];
|
||||
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
||||
return { command: 'npx', args };
|
||||
}
|
||||
```
|
||||
|
||||
Add to `package.json`:
|
||||
```json
|
||||
"import:discovermass": "tsx scripts/import-discovermass.ts"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
- **Sitemaps**: No delay (11 requests)
|
||||
- **Church pages**: 10,000ms between requests (respects `Crawl-delay: 10`)
|
||||
- **Total runtime**: ~56 hours (20,284 × 10s)
|
||||
- **Deployment**: Docker scheduler container (`npm run scheduler` → `discovermass-import` job)
|
||||
|
||||
---
|
||||
|
||||
## Estimated Scale
|
||||
|
||||
| Metric | Value |
|
||||
|---|---|
|
||||
| Churches | 20,284 (majority new for US) |
|
||||
| Mass schedules | ~80,000 (avg 4 per church) |
|
||||
| Confession schedules | ~40,000 (avg 2 per church) |
|
||||
| Runtime | ~56 hours |
|
||||
| Coordinates | Yes (from Google Maps daddr) |
|
||||
Reference in New Issue
Block a user