Files
ScraperControl/docs/plans/2026-03-01-weekdaymasses-importer-design.md

104 lines
3.7 KiB
Markdown
Raw Permalink Normal View History

# weekdaymasses.org.uk Global Importer
## Context
weekdaymasses.org.uk is a UK-based Catholic directory covering ~3,500-4,000 churches globally with mass schedules, coordinates, addresses, and phone numbers. Covers GB, Ireland, and 49+ international countries (India, Sri Lanka, South Korea, Japan, and more). All data served on single HTML pages per area — no pagination or API needed.
## Data Source
Three area pages cover the entire site:
| Page | URL | Est. Churches |
|------|-----|---------------|
| GB | `/en/area/gb/churches` | ~3,000+ |
| Ireland | `/en/area/ireland/churches` | ~300+ |
| Outside GB | `/en/area/outside-gb/churches` | ~152+ |
Individual country/region pages (e.g. `/en/area/india/churches`) are subsets of these three.
### Data per church
- **Name**: h3 heading, format "Church Name (Location)"
- **Address**: plain text after mass times, with postal/zip code
- **Coordinates**: in map link query params `lat=XX.XXXX&lon=YY.YYYY&church_id=NNNNN`
- **Mass times**: format `Day: HH.MMam/pm(Language), HH.MMam/pm(Language)`
- **Phone**: `Tel: +XX XXXX XXXXXX`
- **Website**: occasional links
- **church_id**: unique numeric identifier in map links
### Mass time format
```
Sunday: 6.30am(Tamil), 8.30am(Tamil), 5.30pm(English)
Mon Tue Wed Thu Fri: 6.30am(Tamil)
Saturday: 6.30am(Tamil), 5.30pm(English)
```
Day labels: `Sunday`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, `Saturday`, or combinations like `Mon Tue Wed Thu Fri`. Also `Holy Day` entries.
Time format: `H.MMam/pm` — needs conversion to 24h `HH:MM`.
Language in parentheses maps to our `language` field on mass_schedules.
### Country detection
The address is the last line of each church entry. Country can be detected by:
- GB: UK postal code pattern (e.g. `SW1A 1AA`)
- Ireland: Irish Eircode (e.g. `D01 F5P2`) or "Ireland" in address
- India: 6-digit postal code (e.g. `600088`)
- Others: country name at end of address, or fallback to the area page being scraped
## Design
### Schema
Add to Church model in both BethelGuide and ScraperControl:
```prisma
weekdayMassesId String? @unique @map("weekday_masses_id")
@@index([weekdayMassesId])
```
### Script: `scripts/import-weekdaymasses.ts`
Single script that:
1. Fetches area pages (default: all 3; filterable with `--area gb|ireland|outside-gb|india|...`)
2. Parses HTML into structured church entries
3. Converts mass times from `H.MMam/pm` to `HH:MM` 24h format
4. Detects country from address patterns
5. Matches against existing churches by `weekdayMassesId` (exact) then proximity+name
6. Upserts churches and replaces mass schedules
### HTML parsing strategy
Each church is a block between consecutive h3 headings. Within each block:
- h3 content = church name
- Lines with day labels + times = mass schedule
- Map link = coordinates + church_id
- Last text block before next h3 = address
- `Tel:` prefix = phone
### CLI flags
- `--all` — import all 3 area pages
- `--area <name>` — import specific area (gb, ireland, outside-gb, india, sri-lanka, etc.)
- `--dry-run` — no database writes
- `--resume-from <n>` — skip first N churches
- `--job-id <uuid>` — background job tracking
### Church matcher integration
Add `weekdayMassesId` to `ExistingChurch`, `ChurchCandidate`, and a new match pass in `findDuplicateChurch()`.
### Scheduler integration
Add `weekdaymasses-import` to the sequential imports group in the pipeline, with `getJobCommand()` case and npm script.
## Scope
- ~3,500-4,000 churches with mass schedules
- Most GB/Ireland churches already in DB from OSM (will match and add schedules)
- India/Sri Lanka/international churches partially in DB from OSM/gcatholic
- Value: mass schedule data for thousands of churches that currently have none