# Design: DiscoverMass.com Importer (USA) ## Overview Import 20,284 US Catholic churches with full schedules from discovermass.com. This fills the largest gap in US coverage — MassTimes.org runs with `--skip-us`, leaving most US churches absent from the database. **Note:** St. Paul the Apostle, Chino Hills CA (the motivating example) was manually seeded. The importer will match it by proximity+name and link `discovermassId` without duplicating it. --- ## Source - **Site**: https://discovermass.com - **Coverage**: USA only - **Data**: 20,284 churches with name, address, phone, website, coordinates, mass times, confessions, adoration - **Backend**: WordPress "javo-directory" theme + MassTimes.org data provider - **robots.txt**: `Crawl-delay: 10` — must be followed --- ## Enumeration Strategy 11 WordPress item sitemaps, 2000 entries each (last has 284): ``` https://discovermass.com/wp-sitemap-posts-item-{1..11}.xml ``` Extract all `` URLs. URL pattern: ``` https://discovermass.com/church/{slug}/ ``` Slug becomes `discovermassId` (e.g. `"st-paul-the-apostle-chino-hills-ca"`). No rate limiting on sitemap fetches (11 requests total). --- ## HTML Parsing Server-rendered HTML. No JS execution needed. ### Name `` ### Address Raw text node before `

` in the address block: ``` 14085 Peyton Drive, Chino Hills, CA 91709 ``` Parse: split on `", "` — last segment is `"{STATE} {ZIP}"`, second-to-last is city, rest is street. ### Phone + Website + Coordinates From `