Files
ScraperControl/docs/plans/2026-03-01-weekdaymasses-importer-design.md
Albert 2c51513851 chore: sync with Gitea master and restore local-only files
Reset local main to gitea/master (new source of truth) and restored
local-only files: web scrapers, admin dashboard, ChromaDB integration,
debug scripts, and utility libraries that aren't tracked in Gitea.

Gitea master adds: discovermass, buscarmisas-network, hk-parishes,
bohosluzby, kerknet, gottesdienstzeiten, miserend importers,
ClaimRequest model, forward geocoding, heartbeat healthcheck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-12 19:11:22 -04:00

3.7 KiB

weekdaymasses.org.uk Global Importer

Context

weekdaymasses.org.uk is a UK-based Catholic directory covering ~3,500-4,000 churches globally with mass schedules, coordinates, addresses, and phone numbers. Covers GB, Ireland, and 49+ international countries (India, Sri Lanka, South Korea, Japan, and more). All data served on single HTML pages per area — no pagination or API needed.

Data Source

Three area pages cover the entire site:

Page URL Est. Churches
GB /en/area/gb/churches ~3,000+
Ireland /en/area/ireland/churches ~300+
Outside GB /en/area/outside-gb/churches ~152+

Individual country/region pages (e.g. /en/area/india/churches) are subsets of these three.

Data per church

  • Name: h3 heading, format "Church Name (Location)"
  • Address: plain text after mass times, with postal/zip code
  • Coordinates: in map link query params lat=XX.XXXX&lon=YY.YYYY&church_id=NNNNN
  • Mass times: format Day: HH.MMam/pm(Language), HH.MMam/pm(Language)
  • Phone: Tel: +XX XXXX XXXXXX
  • Website: occasional links
  • church_id: unique numeric identifier in map links

Mass time format

Sunday: 6.30am(Tamil), 8.30am(Tamil), 5.30pm(English)
Mon Tue Wed Thu Fri: 6.30am(Tamil)
Saturday: 6.30am(Tamil), 5.30pm(English)

Day labels: Sunday, Mon, Tue, Wed, Thu, Fri, Saturday, or combinations like Mon Tue Wed Thu Fri. Also Holy Day entries.

Time format: H.MMam/pm — needs conversion to 24h HH:MM.

Language in parentheses maps to our language field on mass_schedules.

Country detection

The address is the last line of each church entry. Country can be detected by:

  • GB: UK postal code pattern (e.g. SW1A 1AA)
  • Ireland: Irish Eircode (e.g. D01 F5P2) or "Ireland" in address
  • India: 6-digit postal code (e.g. 600088)
  • Others: country name at end of address, or fallback to the area page being scraped

Design

Schema

Add to Church model in both BethelGuide and ScraperControl:

weekdayMassesId String? @unique @map("weekday_masses_id")
@@index([weekdayMassesId])

Script: scripts/import-weekdaymasses.ts

Single script that:

  1. Fetches area pages (default: all 3; filterable with --area gb|ireland|outside-gb|india|...)
  2. Parses HTML into structured church entries
  3. Converts mass times from H.MMam/pm to HH:MM 24h format
  4. Detects country from address patterns
  5. Matches against existing churches by weekdayMassesId (exact) then proximity+name
  6. Upserts churches and replaces mass schedules

HTML parsing strategy

Each church is a block between consecutive h3 headings. Within each block:

  • h3 content = church name
  • Lines with day labels + times = mass schedule
  • Map link = coordinates + church_id
  • Last text block before next h3 = address
  • Tel: prefix = phone

CLI flags

  • --all — import all 3 area pages
  • --area <name> — import specific area (gb, ireland, outside-gb, india, sri-lanka, etc.)
  • --dry-run — no database writes
  • --resume-from <n> — skip first N churches
  • --job-id <uuid> — background job tracking

Church matcher integration

Add weekdayMassesId to ExistingChurch, ChurchCandidate, and a new match pass in findDuplicateChurch().

Scheduler integration

Add weekdaymasses-import to the sequential imports group in the pipeline, with getJobCommand() case and npm script.

Scope

  • ~3,500-4,000 churches with mass schedules
  • Most GB/Ireland churches already in DB from OSM (will match and add schedules)
  • India/Sri Lanka/international churches partially in DB from OSM/gcatholic
  • Value: mass schedule data for thousands of churches that currently have none