104 lines
3.7 KiB
Markdown
104 lines
3.7 KiB
Markdown
|
|
# weekdaymasses.org.uk Global Importer
|
||
|
|
|
||
|
|
## Context
|
||
|
|
|
||
|
|
weekdaymasses.org.uk is a UK-based Catholic directory covering ~3,500-4,000 churches globally with mass schedules, coordinates, addresses, and phone numbers. Covers GB, Ireland, and 49+ international countries (India, Sri Lanka, South Korea, Japan, and more). All data served on single HTML pages per area — no pagination or API needed.
|
||
|
|
|
||
|
|
## Data Source
|
||
|
|
|
||
|
|
Three area pages cover the entire site:
|
||
|
|
|
||
|
|
| Page | URL | Est. Churches |
|
||
|
|
|------|-----|---------------|
|
||
|
|
| GB | `/en/area/gb/churches` | ~3,000+ |
|
||
|
|
| Ireland | `/en/area/ireland/churches` | ~300+ |
|
||
|
|
| Outside GB | `/en/area/outside-gb/churches` | ~152+ |
|
||
|
|
|
||
|
|
Individual country/region pages (e.g. `/en/area/india/churches`) are subsets of these three.
|
||
|
|
|
||
|
|
### Data per church
|
||
|
|
|
||
|
|
- **Name**: h3 heading, format "Church Name (Location)"
|
||
|
|
- **Address**: plain text after mass times, with postal/zip code
|
||
|
|
- **Coordinates**: in map link query params `lat=XX.XXXX&lon=YY.YYYY&church_id=NNNNN`
|
||
|
|
- **Mass times**: format `Day: HH.MMam/pm(Language), HH.MMam/pm(Language)`
|
||
|
|
- **Phone**: `Tel: +XX XXXX XXXXXX`
|
||
|
|
- **Website**: occasional links
|
||
|
|
- **church_id**: unique numeric identifier in map links
|
||
|
|
|
||
|
|
### Mass time format
|
||
|
|
|
||
|
|
```
|
||
|
|
Sunday: 6.30am(Tamil), 8.30am(Tamil), 5.30pm(English)
|
||
|
|
Mon Tue Wed Thu Fri: 6.30am(Tamil)
|
||
|
|
Saturday: 6.30am(Tamil), 5.30pm(English)
|
||
|
|
```
|
||
|
|
|
||
|
|
Day labels: `Sunday`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, `Saturday`, or combinations like `Mon Tue Wed Thu Fri`. Also `Holy Day` entries.
|
||
|
|
|
||
|
|
Time format: `H.MMam/pm` — needs conversion to 24h `HH:MM`.
|
||
|
|
|
||
|
|
Language in parentheses maps to our `language` field on mass_schedules.
|
||
|
|
|
||
|
|
### Country detection
|
||
|
|
|
||
|
|
The address is the last line of each church entry. Country can be detected by:
|
||
|
|
- GB: UK postal code pattern (e.g. `SW1A 1AA`)
|
||
|
|
- Ireland: Irish Eircode (e.g. `D01 F5P2`) or "Ireland" in address
|
||
|
|
- India: 6-digit postal code (e.g. `600088`)
|
||
|
|
- Others: country name at end of address, or fallback to the area page being scraped
|
||
|
|
|
||
|
|
## Design
|
||
|
|
|
||
|
|
### Schema
|
||
|
|
|
||
|
|
Add to Church model in both BethelGuide and ScraperControl:
|
||
|
|
|
||
|
|
```prisma
|
||
|
|
weekdayMassesId String? @unique @map("weekday_masses_id")
|
||
|
|
@@index([weekdayMassesId])
|
||
|
|
```
|
||
|
|
|
||
|
|
### Script: `scripts/import-weekdaymasses.ts`
|
||
|
|
|
||
|
|
Single script that:
|
||
|
|
|
||
|
|
1. Fetches area pages (default: all 3; filterable with `--area gb|ireland|outside-gb|india|...`)
|
||
|
|
2. Parses HTML into structured church entries
|
||
|
|
3. Converts mass times from `H.MMam/pm` to `HH:MM` 24h format
|
||
|
|
4. Detects country from address patterns
|
||
|
|
5. Matches against existing churches by `weekdayMassesId` (exact) then proximity+name
|
||
|
|
6. Upserts churches and replaces mass schedules
|
||
|
|
|
||
|
|
### HTML parsing strategy
|
||
|
|
|
||
|
|
Each church is a block between consecutive h3 headings. Within each block:
|
||
|
|
- h3 content = church name
|
||
|
|
- Lines with day labels + times = mass schedule
|
||
|
|
- Map link = coordinates + church_id
|
||
|
|
- Last text block before next h3 = address
|
||
|
|
- `Tel:` prefix = phone
|
||
|
|
|
||
|
|
### CLI flags
|
||
|
|
|
||
|
|
- `--all` — import all 3 area pages
|
||
|
|
- `--area <name>` — import specific area (gb, ireland, outside-gb, india, sri-lanka, etc.)
|
||
|
|
- `--dry-run` — no database writes
|
||
|
|
- `--resume-from <n>` — skip first N churches
|
||
|
|
- `--job-id <uuid>` — background job tracking
|
||
|
|
|
||
|
|
### Church matcher integration
|
||
|
|
|
||
|
|
Add `weekdayMassesId` to `ExistingChurch`, `ChurchCandidate`, and a new match pass in `findDuplicateChurch()`.
|
||
|
|
|
||
|
|
### Scheduler integration
|
||
|
|
|
||
|
|
Add `weekdaymasses-import` to the sequential imports group in the pipeline, with `getJobCommand()` case and npm script.
|
||
|
|
|
||
|
|
## Scope
|
||
|
|
|
||
|
|
- ~3,500-4,000 churches with mass schedules
|
||
|
|
- Most GB/Ireland churches already in DB from OSM (will match and add schedules)
|
||
|
|
- India/Sri Lanka/international churches partially in DB from OSM/gcatholic
|
||
|
|
- Value: mass schedule data for thousands of churches that currently have none
|