chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored local-only files: web scrapers, admin dashboard, ChromaDB integration, debug scripts, and utility libraries that aren't tracked in Gitea. Gitea master adds: discovermass, buscarmisas-network, hk-parishes, bohosluzby, kerknet, gottesdienstzeiten, miserend importers, ClaimRequest model, forward geocoding, heartbeat healthcheck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
103
docs/plans/2026-03-01-weekdaymasses-importer-design.md
Normal file
103
docs/plans/2026-03-01-weekdaymasses-importer-design.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# weekdaymasses.org.uk Global Importer
|
||||
|
||||
## Context
|
||||
|
||||
weekdaymasses.org.uk is a UK-based Catholic directory covering ~3,500-4,000 churches globally with mass schedules, coordinates, addresses, and phone numbers. Covers GB, Ireland, and 49+ international countries (India, Sri Lanka, South Korea, Japan, and more). All data served on single HTML pages per area — no pagination or API needed.
|
||||
|
||||
## Data Source
|
||||
|
||||
Three area pages cover the entire site:
|
||||
|
||||
| Page | URL | Est. Churches |
|
||||
|------|-----|---------------|
|
||||
| GB | `/en/area/gb/churches` | ~3,000+ |
|
||||
| Ireland | `/en/area/ireland/churches` | ~300+ |
|
||||
| Outside GB | `/en/area/outside-gb/churches` | ~152+ |
|
||||
|
||||
Individual country/region pages (e.g. `/en/area/india/churches`) are subsets of these three.
|
||||
|
||||
### Data per church
|
||||
|
||||
- **Name**: h3 heading, format "Church Name (Location)"
|
||||
- **Address**: plain text after mass times, with postal/zip code
|
||||
- **Coordinates**: in map link query params `lat=XX.XXXX&lon=YY.YYYY&church_id=NNNNN`
|
||||
- **Mass times**: format `Day: HH.MMam/pm(Language), HH.MMam/pm(Language)`
|
||||
- **Phone**: `Tel: +XX XXXX XXXXXX`
|
||||
- **Website**: occasional links
|
||||
- **church_id**: unique numeric identifier in map links
|
||||
|
||||
### Mass time format
|
||||
|
||||
```
|
||||
Sunday: 6.30am(Tamil), 8.30am(Tamil), 5.30pm(English)
|
||||
Mon Tue Wed Thu Fri: 6.30am(Tamil)
|
||||
Saturday: 6.30am(Tamil), 5.30pm(English)
|
||||
```
|
||||
|
||||
Day labels: `Sunday`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, `Saturday`, or combinations like `Mon Tue Wed Thu Fri`. Also `Holy Day` entries.
|
||||
|
||||
Time format: `H.MMam/pm` — needs conversion to 24h `HH:MM`.
|
||||
|
||||
Language in parentheses maps to our `language` field on mass_schedules.
|
||||
|
||||
### Country detection
|
||||
|
||||
The address is the last line of each church entry. Country can be detected by:
|
||||
- GB: UK postal code pattern (e.g. `SW1A 1AA`)
|
||||
- Ireland: Irish Eircode (e.g. `D01 F5P2`) or "Ireland" in address
|
||||
- India: 6-digit postal code (e.g. `600088`)
|
||||
- Others: country name at end of address, or fallback to the area page being scraped
|
||||
|
||||
## Design
|
||||
|
||||
### Schema
|
||||
|
||||
Add to Church model in both BethelGuide and ScraperControl:
|
||||
|
||||
```prisma
|
||||
weekdayMassesId String? @unique @map("weekday_masses_id")
|
||||
@@index([weekdayMassesId])
|
||||
```
|
||||
|
||||
### Script: `scripts/import-weekdaymasses.ts`
|
||||
|
||||
Single script that:
|
||||
|
||||
1. Fetches area pages (default: all 3; filterable with `--area gb|ireland|outside-gb|india|...`)
|
||||
2. Parses HTML into structured church entries
|
||||
3. Converts mass times from `H.MMam/pm` to `HH:MM` 24h format
|
||||
4. Detects country from address patterns
|
||||
5. Matches against existing churches by `weekdayMassesId` (exact) then proximity+name
|
||||
6. Upserts churches and replaces mass schedules
|
||||
|
||||
### HTML parsing strategy
|
||||
|
||||
Each church is a block between consecutive h3 headings. Within each block:
|
||||
- h3 content = church name
|
||||
- Lines with day labels + times = mass schedule
|
||||
- Map link = coordinates + church_id
|
||||
- Last text block before next h3 = address
|
||||
- `Tel:` prefix = phone
|
||||
|
||||
### CLI flags
|
||||
|
||||
- `--all` — import all 3 area pages
|
||||
- `--area <name>` — import specific area (gb, ireland, outside-gb, india, sri-lanka, etc.)
|
||||
- `--dry-run` — no database writes
|
||||
- `--resume-from <n>` — skip first N churches
|
||||
- `--job-id <uuid>` — background job tracking
|
||||
|
||||
### Church matcher integration
|
||||
|
||||
Add `weekdayMassesId` to `ExistingChurch`, `ChurchCandidate`, and a new match pass in `findDuplicateChurch()`.
|
||||
|
||||
### Scheduler integration
|
||||
|
||||
Add `weekdaymasses-import` to the sequential imports group in the pipeline, with `getJobCommand()` case and npm script.
|
||||
|
||||
## Scope
|
||||
|
||||
- ~3,500-4,000 churches with mass schedules
|
||||
- Most GB/Ireland churches already in DB from OSM (will match and add schedules)
|
||||
- India/Sri Lanka/international churches partially in DB from OSM/gcatholic
|
||||
- Value: mass schedule data for thousands of churches that currently have none
|
||||
Reference in New Issue
Block a user