Compare commits
26 Commits
5c7bc4cfed
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
027ca59a01 | ||
|
|
9d0af3289a | ||
|
|
6d1c7eb3c5 | ||
| 206b64b9b8 | |||
|
|
4609fd97db | ||
|
|
2c51513851 | ||
|
|
76cca3ba75 | ||
|
|
3cf1465fb6 | ||
|
|
92265cf27f | ||
|
|
8075072c24 | ||
|
|
3ebbc3732f | ||
|
|
eedb442e78 | ||
|
|
38274174a9 | ||
|
|
328d146201 | ||
|
|
9aea12f4b0 | ||
|
|
033f805965 | ||
|
|
3bd4d2e2f9 | ||
|
|
73d8e8990c | ||
|
|
3cb780a692 | ||
|
|
8f7c4d1698 | ||
|
|
857eaedbcf | ||
|
|
93d8a9080a | ||
|
|
da4aa61860 | ||
|
|
9593e08983 | ||
|
|
2b37c2d5f2 | ||
|
|
dde083c32e |
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
@@ -0,0 +1,6 @@
|
||||
node_modules/
|
||||
.next/
|
||||
.env
|
||||
.env.*
|
||||
.claude/
|
||||
.worktrees/
|
||||
254
CLAUDE.md
Normal file
254
CLAUDE.md
Normal file
@@ -0,0 +1,254 @@
|
||||
# Role in Ecosystem
|
||||
|
||||
**ScraperControl** is the data pipeline for the Church project — handling scraping, enrichment, ChromaDB semantic search, and data transfer to Neon production. It runs on the Synology NAS (Docker), not Vercel.
|
||||
|
||||
- **Schema sync**: Handled by `npm run sync` from the `Church/` root directory. No need to manually copy schema files.
|
||||
- **Coordinated deployment**: Use `npm run deploy` from `Church/` root for full pipeline deployment.
|
||||
- **Schema source of truth**: BethelGuide — never run `prisma migrate` in ScraperControl.
|
||||
|
||||
---
|
||||
|
||||
# Claude Instructions for ScraperControl
|
||||
|
||||
## Project Overview
|
||||
|
||||
**ScraperControl** is the scraping, enrichment, and data management backend for the NearestMass church finder. It provides:
|
||||
|
||||
1. **Admin Dashboard** (Next.js): Job management UI at port 3001
|
||||
2. **Web Scrapers**: Playwright-based scrapers for extracting mass schedules from church websites
|
||||
3. **Enrichment Pipelines**: Google Places, FreeSearch, reverse geocode enrichment
|
||||
4. **ChromaDB Integration**: Semantic search for deduplication, content classification, and change detection
|
||||
5. **Scheduler**: Database-driven job queue for automated scraping
|
||||
|
||||
### Shared Database Architecture
|
||||
|
||||
ScraperControl and BethelGuide share the **same NAS PostgreSQL database** (192.168.0.145:5434). BethelGuide is the **schema source of truth**. After any schema change in BethelGuide:
|
||||
|
||||
1. Copy `BethelGuide/prisma/schema.prisma` → `ScraperControl/prisma/schema.prisma`
|
||||
2. Run `npx prisma generate` in ScraperControl (NOT `migrate`)
|
||||
3. Rebuild Docker containers if needed
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack
|
||||
|
||||
| Layer | Technology |
|
||||
|-------|------------|
|
||||
| Admin UI | Next.js 16, React 19, Tailwind CSS v4 |
|
||||
| Database | Shared NAS PostgreSQL (192.168.0.145:5434) |
|
||||
| ORM | Prisma 7 (`@prisma/adapter-pg` + `pg` Pool) |
|
||||
| Web Scraping | Playwright (headless Chromium) |
|
||||
| Vector DB | ChromaDB (192.168.0.145:8000) |
|
||||
| Embeddings | Ollama on MacBook (192.168.0.75:11434) with nomic-embed-text |
|
||||
| Scheduling | node-cron + database-driven job queue |
|
||||
| Containerization | Docker, Docker Compose |
|
||||
|
||||
---
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
src/
|
||||
├── app/ # Next.js Admin Dashboard (port 3001)
|
||||
│ ├── page.tsx # Main dashboard (Jobs, Scrapes, Search tabs)
|
||||
│ └── api/admin/ # Admin API routes
|
||||
│ ├── jobs/ # Job management (GET/POST/PATCH)
|
||||
│ ├── scrape-log/ # Recently scraped churches log
|
||||
│ └── freesearch-log/ # FreeSearch results log
|
||||
│
|
||||
├── chromadb/ # ChromaDB integration
|
||||
│ ├── client.ts # ChromaDB client singleton
|
||||
│ ├── embeddings.ts # OpenAI-compatible embedding helper (Ollama)
|
||||
│ ├── collections.ts # Collection definitions (5 collections)
|
||||
│ └── queries.ts # Query helpers per use case
|
||||
│
|
||||
├── lib/ # Core business logic
|
||||
│ ├── db.ts # Prisma client singleton
|
||||
│ ├── admin-auth.ts # Timing-safe API key auth
|
||||
│ ├── geo.ts # Haversine distance (minimal)
|
||||
│ ├── scraper-service.ts # Scraper orchestration
|
||||
│ ├── overpass-client.ts # OpenStreetMap Overpass API
|
||||
│ ├── church-matcher.ts # Church matching/dedup
|
||||
│ └── masstimes-scraper.ts # MassTimes.org integration
|
||||
│
|
||||
└── scrapers/ # Web scraping system
|
||||
├── base-scraper.ts # Base class
|
||||
├── index.ts # Exports
|
||||
├── registry.ts # Strategy registry
|
||||
├── url-discovery.ts # Mass schedule URL finder
|
||||
├── strategies/ # Language-specific scrapers
|
||||
│ ├── generic.ts # Fallback (10+ languages)
|
||||
│ ├── english.ts
|
||||
│ ├── french.ts
|
||||
│ ├── german.ts
|
||||
│ ├── italian.ts
|
||||
│ └── spanish.ts
|
||||
└── i18n/ # Internationalization
|
||||
├── day-names.ts # Day name patterns per language
|
||||
└── day-ranges.ts # Day range parsing ("Monday-Friday")
|
||||
|
||||
scripts/ # CLI scripts
|
||||
├── scrape-churches.ts # Scrape churches by language
|
||||
├── scrape-masstimes.ts # Scrape from MassTimes.org
|
||||
├── import-osm-churches.ts # Import from OpenStreetMap
|
||||
├── import-osm-region.ts # Import specific OSM region
|
||||
├── enrich-with-google-places.ts # Google Places enrichment
|
||||
├── enrich-with-freesearch.ts # FreeSearch website enrichment
|
||||
├── enrich-with-reverse-geocode.ts # Reverse geocode enrichment
|
||||
├── scheduler.ts # Background job scheduler
|
||||
├── dedup-mass-schedules.ts # Mass schedule deduplication
|
||||
├── dedup-churches.ts # Church dedup via ChromaDB
|
||||
├── transfer-enriched-to-neon.ts # NAS → Neon production sync
|
||||
├── populate-chromadb.ts # Bulk-populate ChromaDB collections
|
||||
├── populate-city-normalized.ts
|
||||
├── save-schedules-to-db.ts
|
||||
├── test-scraper.ts # Test scraper on a URL
|
||||
├── test-url-discovery.ts # Test URL discovery
|
||||
├── test-edge-cases.ts # International edge case tests
|
||||
└── debug/ # Debug/investigation scripts (~44 files)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Commands
|
||||
|
||||
```bash
|
||||
# === DEVELOPMENT ===
|
||||
npm run dev # Start admin dashboard (localhost:3001)
|
||||
npm run build # Build Next.js app
|
||||
|
||||
# === SCRAPING ===
|
||||
npm run scrape:churches # Scrape churches (pass --language, --all flags)
|
||||
npm run scrape:masstimes # Scrape from MassTimes.org
|
||||
npm run test:scraper # Test scraper on a URL
|
||||
npm run test:discover # Test URL discovery
|
||||
|
||||
# === ENRICHMENT ===
|
||||
npm run enrich:places # Google Places enrichment
|
||||
npm run enrich:freesearch # FreeSearch website enrichment
|
||||
|
||||
# === DATA MANAGEMENT ===
|
||||
npm run dedup:masses # Deduplicate mass schedules
|
||||
npm run import:osm # Import churches from OpenStreetMap
|
||||
npm run transfer:neon # Transfer enriched data to Neon production
|
||||
npm run scheduler # Start background job scheduler
|
||||
|
||||
# === CHROMADB ===
|
||||
npx tsx scripts/populate-chromadb.ts --all # Populate all collections
|
||||
npx tsx scripts/populate-chromadb.ts --collection church_identity # Single collection
|
||||
npx tsx scripts/dedup-churches.ts --threshold 0.15 # Find duplicate churches
|
||||
|
||||
# === DOCKER (on NAS) ===
|
||||
docker compose build scraper # Build scraper image
|
||||
docker compose --profile tools run --rm scraper <command> # Run one-off scraper
|
||||
docker compose up -d scheduler freesearch-enrichment # Start background services
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ChromaDB Integration
|
||||
|
||||
### Collections
|
||||
|
||||
| Collection | Purpose | Documents |
|
||||
|---|---|---|
|
||||
| `church_identity` | Deduplication | `{name} {address} {city} {country}` |
|
||||
| `search_results` | FreeSearch matching | `{title} {snippet} {url}` |
|
||||
| `page_classification` | Content classification | Page text (first 2000 chars) |
|
||||
| `schedule_sections` | Schedule detection | Text blocks with mass times |
|
||||
| `page_snapshots` | Change detection | Full page text |
|
||||
|
||||
### Infrastructure
|
||||
|
||||
- **ChromaDB server**: `http://192.168.0.145:8000` (on NAS)
|
||||
- **Embedding API**: `http://192.168.0.75:11434/v1` (Ollama on MacBook M1)
|
||||
- **Embedding model**: `nomic-embed-text` (~270MB, fast on M1)
|
||||
|
||||
### Prerequisite
|
||||
|
||||
Ollama must be running on the MacBook with LAN access enabled:
|
||||
```bash
|
||||
OLLAMA_HOST=0.0.0.0 ollama serve
|
||||
ollama pull nomic-embed-text
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Docker Services
|
||||
|
||||
| Service | Profile | Purpose |
|
||||
|---|---|---|
|
||||
| app | (default) | Admin dashboard on port 3001 |
|
||||
| scraper | tools | Generic scraper (on-demand) |
|
||||
| scraper-english | scraper-english | English language scraper |
|
||||
| scraper-french | scraper-french | French language scraper |
|
||||
| scraper-german | scraper-german | German language scraper |
|
||||
| scraper-italian | scraper-italian | Italian language scraper |
|
||||
| scraper-spanish | scraper-spanish | Spanish language scraper |
|
||||
| scraper-generic | scraper-generic | Generic fallback scraper |
|
||||
| scheduler | (default) | Background job scheduler |
|
||||
| freesearch-enrichment | (default) | FreeSearch enrichment daemon |
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
```env
|
||||
DATABASE_URL="postgresql://postgres:postgres@192.168.0.145:5434/nearestmass"
|
||||
ADMIN_API_KEY=your-secret-key
|
||||
CHROMADB_URL=http://192.168.0.145:8000
|
||||
EMBEDDING_API_URL=http://192.168.0.75:11434/v1
|
||||
EMBEDDING_MODEL=nomic-embed-text
|
||||
GOOGLE_PLACES_API_KEY=your-google-key
|
||||
FREESEARCH_URL=http://192.168.0.145:3111
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## NAS Deployment
|
||||
|
||||
ScraperControl is deployed on the Synology NAS at `/volume1/docker/scraper-control/`.
|
||||
|
||||
### Container Layout
|
||||
|
||||
| Container | Purpose | Port |
|
||||
|-----------|---------|------|
|
||||
| scraper-control-app-1 | Admin dashboard | 3001 |
|
||||
| scraper-control-scheduler-1 | Job scheduler | - |
|
||||
| scraper-control-freesearch-enrichment-1 | FreeSearch daemon | - |
|
||||
|
||||
The `db` container (`nearestmass-db-1`) is managed by BethelGuide's compose file at `/volume1/docker/nearestmass/`. ScraperControl joins the same `nearestmass_default` external Docker network — no `depends_on` allowed since `db` is in a different compose file.
|
||||
|
||||
### Deploying Updates
|
||||
|
||||
```bash
|
||||
# From local machine:
|
||||
bash scripts/deploy-to-nas.sh
|
||||
|
||||
# Or manually:
|
||||
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
|
||||
/Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/
|
||||
|
||||
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scraper && /usr/local/bin/docker compose up -d scheduler freesearch-enrichment'
|
||||
```
|
||||
|
||||
### Rebuilding Admin Dashboard
|
||||
|
||||
```bash
|
||||
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build app && /usr/local/bin/docker compose up -d app'
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **DO NOT** add `depends_on: db` to any service — `db` is in BethelGuide's compose file
|
||||
- The `.env` on NAS uses host IP (`192.168.0.145:5434`) for scripts run outside Docker
|
||||
- The `docker-compose.yml` environment overrides use `db:5432` (Docker DNS via shared network)
|
||||
- Docker binary on NAS is at `/usr/local/bin/docker`
|
||||
|
||||
### NAS Docker Health
|
||||
|
||||
The Synology NAS (4 CPU, 17GB RAM) runs 23 containers across 7 projects. Church project containers (5) all have memory limits and log rotation. See `memory/nas-docker-health.md` for full inventory.
|
||||
|
||||
**Scheduler hardening**: Uses `detached: true` + process group kill to prevent orphaned Chromium processes, `init: true` for zombie reaping, 24h job timeout, 8GB memory limit.
|
||||
|
||||
**Maintenance**: Docker is on /volume1 (15TB free). Run `docker builder prune -f` occasionally to keep build cache tidy.
|
||||
30
Dockerfile
Normal file
30
Dockerfile
Normal file
@@ -0,0 +1,30 @@
|
||||
FROM node:20-alpine AS deps
|
||||
WORKDIR /app
|
||||
COPY package.json package-lock.json* ./
|
||||
COPY prisma ./prisma/
|
||||
RUN npm ci && npx prisma generate
|
||||
|
||||
FROM node:20-alpine AS builder
|
||||
WORKDIR /app
|
||||
COPY --from=deps /app/node_modules ./node_modules
|
||||
COPY . .
|
||||
ENV NEXT_TELEMETRY_DISABLED=1
|
||||
RUN npm run build
|
||||
|
||||
FROM node:20-alpine AS runner
|
||||
WORKDIR /app
|
||||
ENV NODE_ENV=production
|
||||
ENV NEXT_TELEMETRY_DISABLED=1
|
||||
ENV PORT=3001
|
||||
ENV HOSTNAME="0.0.0.0"
|
||||
|
||||
RUN addgroup --system --gid 1001 nodejs && \
|
||||
adduser --system --uid 1001 nextjs
|
||||
|
||||
COPY --from=builder /app/public ./public
|
||||
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
|
||||
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
|
||||
|
||||
USER nextjs
|
||||
EXPOSE 3001
|
||||
CMD ["node", "server.js"]
|
||||
21
Dockerfile.scraper
Normal file
21
Dockerfile.scraper
Normal file
@@ -0,0 +1,21 @@
|
||||
FROM node:20-bookworm-slim
|
||||
|
||||
# Install Playwright system dependencies + Chromium
|
||||
RUN apt-get update && \
|
||||
npx playwright install --with-deps chromium && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY package.json package-lock.json ./
|
||||
COPY prisma ./prisma/
|
||||
RUN npm ci
|
||||
RUN npx prisma generate
|
||||
|
||||
COPY src ./src/
|
||||
COPY scripts ./scripts/
|
||||
COPY tsconfig.json ./
|
||||
|
||||
# Default: run the masstimes scraper
|
||||
CMD ["npx", "tsx", "scripts/scrape-masstimes.ts"]
|
||||
315
docker-compose.yml
Normal file
315
docker-compose.yml
Normal file
@@ -0,0 +1,315 @@
|
||||
x-scraper-logging: &scraper-logging
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: "50m"
|
||||
max-file: "3"
|
||||
|
||||
x-scraper-limits: &scraper-limits
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
|
||||
services:
|
||||
db:
|
||||
image: postgres:15-alpine
|
||||
ports:
|
||||
- "5434:5432"
|
||||
environment:
|
||||
- POSTGRES_USER=postgres
|
||||
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-postgres}
|
||||
- POSTGRES_DB=nearestmass
|
||||
volumes:
|
||||
- postgres_data:/var/lib/postgresql/data
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U postgres"]
|
||||
interval: 5s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
restart: unless-stopped
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
shm_size: 256m
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: "50m"
|
||||
max-file: "3"
|
||||
|
||||
app:
|
||||
build: .
|
||||
ports:
|
||||
- "3001:3001"
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- ADMIN_API_KEY=${ADMIN_API_KEY}
|
||||
depends_on:
|
||||
db:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 1G
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: "20m"
|
||||
max-file: "3"
|
||||
|
||||
scraper:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- tools
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# English scraper (on-demand via scheduler or API)
|
||||
scraper-english:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "english", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-english
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# Generic scraper (for languages without dedicated scrapers)
|
||||
scraper-generic:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "generic", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-generic
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# French scraper (on-demand via scheduler or API)
|
||||
scraper-french:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "french", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-french
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# German scraper (on-demand via scheduler or API)
|
||||
scraper-german:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "german", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-german
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# Italian scraper (on-demand via scheduler or API)
|
||||
scraper-italian:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "italian", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-italian
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# Spanish scraper (on-demand via scheduler or API)
|
||||
scraper-spanish:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "spanish", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-spanish
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# Polish scraper (on-demand via scheduler or API)
|
||||
scraper-polish:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "polish", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-polish
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# Portuguese scraper (on-demand via scheduler or API)
|
||||
scraper-portuguese:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "portuguese", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-portuguese
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# Dutch scraper (on-demand via scheduler or API)
|
||||
scraper-dutch:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "dutch", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-dutch
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# Czech scraper (on-demand via scheduler or API)
|
||||
scraper-czech:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "czech", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-czech
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
# Hungarian scraper (on-demand via scheduler or API)
|
||||
scraper-hungarian:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
command: ["npx", "tsx", "scripts/scrape-churches.ts", "--all", "--language", "hungarian", "--max-failures", "10"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
profiles:
|
||||
- scraper-hungarian
|
||||
<<: *scraper-limits
|
||||
logging: *scraper-logging
|
||||
|
||||
scheduler:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
init: true # tini as PID 1 — reaps zombie Chromium processes
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
- BAIDU_MAPS_API_KEY=${BAIDU_MAPS_API_KEY}
|
||||
command: ["npx", "tsx", "scripts/scheduler.ts"]
|
||||
volumes:
|
||||
- ./logs:/app/logs
|
||||
depends_on:
|
||||
db:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 8G
|
||||
stop_grace_period: 30s
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . || exit 1"]
|
||||
interval: 90s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 90s
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: "100m"
|
||||
max-file: "5"
|
||||
|
||||
freesearch-enrichment:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.scraper
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- DATABASE_URL=postgresql://postgres:postgres@db:5432/nearestmass
|
||||
- FREESEARCH_URL=${FREESEARCH_URL}
|
||||
- CHROMADB_URL=${CHROMADB_URL}
|
||||
command: ["npx", "tsx", "scripts/enrich-with-freesearch.ts", "--continuous"]
|
||||
volumes:
|
||||
- ./logs:/app/logs
|
||||
depends_on:
|
||||
db:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: "50m"
|
||||
max-file: "3"
|
||||
|
||||
volumes:
|
||||
postgres_data:
|
||||
43
docs/plans/2026-02-25-parallel-scrapers-design.md
Normal file
43
docs/plans/2026-02-25-parallel-scrapers-design.md
Normal file
@@ -0,0 +1,43 @@
|
||||
# Parallel Scrapers with Country Mapping Fix
|
||||
|
||||
## Problem
|
||||
|
||||
The scheduler runs scrapers sequentially — one language at a time. With 19,996 unscraped churches queued across 10 language scrapers, a full cycle takes days. The English scraper alone runs 30+ hours. Additionally, 1,414 churches in unmapped countries (BE, CH, IN, etc.) fall through to the generic scraper instead of being handled by appropriate language scrapers.
|
||||
|
||||
## Changes
|
||||
|
||||
### 1. Country Mapping Additions (scraper-service.ts)
|
||||
|
||||
Add to `COUNTRY_SCRAPER_MAP`:
|
||||
- **English**: IN, SG, MY, KE, JM, TT, GH, NG, ZA, TZ, UG
|
||||
- **French**: BE, LU
|
||||
- **German**: CH, SI
|
||||
- **Italian**: HR, RO
|
||||
|
||||
### 2. Parallel Pipeline Groups (scheduler.ts)
|
||||
|
||||
Replace sequential `PIPELINE_PHASES` array with grouped phases:
|
||||
|
||||
| Group | Phases | Concurrency |
|
||||
|-------|--------|-------------|
|
||||
| 1 | osm-import, gcatholic-import | Sequential (shared data) |
|
||||
| 2 | english, french, german | Parallel (3) |
|
||||
| 3 | polish, spanish, italian | Parallel (3) |
|
||||
| 4 | portuguese, czech, dutch | Parallel (3) |
|
||||
| 5 | hungarian, generic | Parallel (2) |
|
||||
|
||||
Scheduler starts all jobs in a group simultaneously, waits for all to finish, then advances to the next group.
|
||||
|
||||
### 3. Generic Scraper Deprioritized
|
||||
|
||||
- Moved to last group
|
||||
- Pre-check query: skip if no unscraped churches in generic queue (avoids wasteful re-scrapes)
|
||||
|
||||
### 4. Resource Changes
|
||||
|
||||
- Scheduler container memory limit: 4GB → 10GB (3 concurrent Playwright/Chromium processes)
|
||||
- No new Docker containers or compose changes needed — existing child process spawning approach is kept
|
||||
|
||||
## Approach
|
||||
|
||||
Approach B: parallel child processes inside the scheduler container. No Docker-in-Docker. The scheduler already spawns `npx tsx` processes — we just allow multiple to run concurrently instead of waiting for each to finish before starting the next.
|
||||
423
docs/plans/2026-02-25-parallel-scrapers.md
Normal file
423
docs/plans/2026-02-25-parallel-scrapers.md
Normal file
@@ -0,0 +1,423 @@
|
||||
# Parallel Scrapers Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Run language scrapers in parallel groups of 3, add missing country mappings, and deprioritize the generic scraper.
|
||||
|
||||
**Architecture:** Replace sequential pipeline phases with grouped phases. Groups run their jobs concurrently (max 3), then wait for all to complete before advancing. Import phases stay sequential. The scheduler tracks a `groupJobsRemaining` counter per group instead of advancing on every job completion.
|
||||
|
||||
**Tech Stack:** TypeScript, node child_process spawn, Prisma, Docker Compose
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add Missing Country Mappings
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/scraper-service.ts:29-45`
|
||||
|
||||
**Step 1: Update COUNTRY_SCRAPER_MAP**
|
||||
|
||||
Add these entries to the existing `COUNTRY_SCRAPER_MAP` object at `src/lib/scraper-service.ts:29`:
|
||||
|
||||
```typescript
|
||||
const COUNTRY_SCRAPER_MAP: Record<string, string> = {
|
||||
US: 'english', CA: 'english', GB: 'english',
|
||||
AU: 'english', NZ: 'english', IE: 'english', PH: 'english',
|
||||
IN: 'english', SG: 'english', MY: 'english', KE: 'english',
|
||||
JM: 'english', TT: 'english', GH: 'english', NG: 'english',
|
||||
ZA: 'english', TZ: 'english', UG: 'english',
|
||||
FR: 'french', BE: 'french', LU: 'french',
|
||||
ES: 'spanish', MX: 'spanish', AR: 'spanish', CO: 'spanish',
|
||||
CL: 'spanish', PE: 'spanish', EC: 'spanish', VE: 'spanish',
|
||||
CR: 'spanish', PA: 'spanish', GT: 'spanish', CU: 'spanish',
|
||||
HN: 'spanish', SV: 'spanish', NI: 'spanish', BO: 'spanish',
|
||||
PY: 'spanish', UY: 'spanish', DO: 'spanish',
|
||||
IT: 'italian', SM: 'italian', VA: 'italian',
|
||||
HR: 'italian', RO: 'italian',
|
||||
DE: 'german', AT: 'german', LI: 'german',
|
||||
CH: 'german', SI: 'german',
|
||||
PL: 'polish',
|
||||
PT: 'portuguese', BR: 'portuguese',
|
||||
NL: 'dutch',
|
||||
CZ: 'czech', SK: 'czech',
|
||||
HU: 'hungarian',
|
||||
};
|
||||
```
|
||||
|
||||
Also update `buildLanguageFilter` at `src/lib/scraper-service.ts:346-463` to include the new countries in each language filter's country list:
|
||||
|
||||
- `english` filter (line 356): add `'IN', 'SG', 'MY', 'KE', 'JM', 'TT', 'GH', 'NG', 'ZA', 'TZ', 'UG'`
|
||||
- `french` filter (line 366): add `'BE', 'LU'` → `{ in: ['FR', 'BE', 'LU'] }`
|
||||
- `spanish` filter: already has all needed countries
|
||||
- `italian` filter (line 387): add `'HR', 'RO'` → `{ in: ['IT', 'SM', 'VA', 'HR', 'RO'] }`
|
||||
- `german` filter (line 397): add `'CH', 'SI'` → `{ in: ['DE', 'AT', 'LI', 'CH', 'SI'] }`
|
||||
|
||||
**Step 2: Verify build**
|
||||
|
||||
Run: `npm run build`
|
||||
Expected: Build succeeds with no errors
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add src/lib/scraper-service.ts
|
||||
git commit -m "feat: add missing country mappings to language scrapers
|
||||
|
||||
Add BE/LU→french, CH/SI→german, HR/RO→italian, IN/SG/MY/KE/JM/TT/GH/NG/ZA/TZ/UG→english.
|
||||
~1,400 previously unmapped churches now routed to proper language scrapers."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Rewrite Scheduler for Parallel Groups
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/scheduler.ts`
|
||||
|
||||
**Step 1: Replace pipeline data structure**
|
||||
|
||||
Replace the `PipelinePhase` interface, `PIPELINE_PHASES` array (lines 27-49), and `CycleState` interface (lines 53-69) with:
|
||||
|
||||
```typescript
|
||||
interface PipelinePhase {
|
||||
name: string;
|
||||
type: string;
|
||||
language?: string;
|
||||
config: Record<string, unknown>;
|
||||
}
|
||||
|
||||
interface PipelineGroup {
|
||||
name: string;
|
||||
phases: PipelinePhase[];
|
||||
mode: 'sequential' | 'parallel';
|
||||
}
|
||||
|
||||
const PIPELINE_GROUPS: PipelineGroup[] = [
|
||||
{
|
||||
name: 'imports',
|
||||
mode: 'sequential',
|
||||
phases: [
|
||||
{ name: 'osm-import-p1', type: 'osm-import', config: { priority: 1 } },
|
||||
{ name: 'gcatholic-import', type: 'gcatholic-import', config: { delay: 2000 } },
|
||||
],
|
||||
},
|
||||
{
|
||||
name: 'scrapers-batch-1',
|
||||
mode: 'parallel',
|
||||
phases: [
|
||||
{ name: 'scraper-english', type: 'scraper', language: 'english', config: { allMode: true, maxFailures: 10, language: 'english' } },
|
||||
{ name: 'scraper-french', type: 'scraper', language: 'french', config: { allMode: true, maxFailures: 10, language: 'french' } },
|
||||
{ name: 'scraper-german', type: 'scraper', language: 'german', config: { allMode: true, maxFailures: 10, language: 'german' } },
|
||||
],
|
||||
},
|
||||
{
|
||||
name: 'scrapers-batch-2',
|
||||
mode: 'parallel',
|
||||
phases: [
|
||||
{ name: 'scraper-polish', type: 'scraper', language: 'polish', config: { allMode: true, maxFailures: 10, language: 'polish' } },
|
||||
{ name: 'scraper-spanish', type: 'scraper', language: 'spanish', config: { allMode: true, maxFailures: 10, language: 'spanish' } },
|
||||
{ name: 'scraper-italian', type: 'scraper', language: 'italian', config: { allMode: true, maxFailures: 10, language: 'italian' } },
|
||||
],
|
||||
},
|
||||
{
|
||||
name: 'scrapers-batch-3',
|
||||
mode: 'parallel',
|
||||
phases: [
|
||||
{ name: 'scraper-portuguese', type: 'scraper', language: 'portuguese', config: { allMode: true, maxFailures: 10, language: 'portuguese' } },
|
||||
{ name: 'scraper-czech', type: 'scraper', language: 'czech', config: { allMode: true, maxFailures: 10, language: 'czech' } },
|
||||
{ name: 'scraper-dutch', type: 'scraper', language: 'dutch', config: { allMode: true, maxFailures: 10, language: 'dutch' } },
|
||||
],
|
||||
},
|
||||
{
|
||||
name: 'scrapers-batch-4',
|
||||
mode: 'parallel',
|
||||
phases: [
|
||||
{ name: 'scraper-hungarian', type: 'scraper', language: 'hungarian', config: { allMode: true, maxFailures: 10, language: 'hungarian' } },
|
||||
{ name: 'scraper-generic', type: 'scraper', language: 'generic', config: { allMode: true, maxFailures: 10, language: 'generic' } },
|
||||
],
|
||||
},
|
||||
];
|
||||
```
|
||||
|
||||
**Step 2: Replace CycleState**
|
||||
|
||||
```typescript
|
||||
interface CycleState {
|
||||
currentGroupIndex: number;
|
||||
currentSequentialPhaseIndex: number; // for sequential groups, tracks which phase within the group
|
||||
cycleNumber: number;
|
||||
cycleStartedAt: Date | null;
|
||||
lastCycleCompletedAt: Date | null;
|
||||
waitingForCooldown: boolean;
|
||||
activeGroupJobs: number; // how many jobs still running in the current group
|
||||
}
|
||||
|
||||
const cycleState: CycleState = {
|
||||
currentGroupIndex: 0,
|
||||
currentSequentialPhaseIndex: 0,
|
||||
cycleNumber: 0,
|
||||
cycleStartedAt: null,
|
||||
lastCycleCompletedAt: null,
|
||||
waitingForCooldown: false,
|
||||
activeGroupJobs: 0,
|
||||
};
|
||||
```
|
||||
|
||||
**Step 3: Rewrite pollAndAdvancePipeline**
|
||||
|
||||
Replace the entire `pollAndAdvancePipeline` function (lines 306-385) and `advancePipelinePhase` function (lines 387-390) with:
|
||||
|
||||
```typescript
|
||||
async function pollAndAdvancePipeline(): Promise<void> {
|
||||
try {
|
||||
// 1. Check for manual pending jobs from admin API (priority over pipeline)
|
||||
if (runningJobs.size === 0) {
|
||||
const manualJob = await prisma.backgroundJob.findFirst({
|
||||
where: {
|
||||
status: 'pending',
|
||||
NOT: { config: { path: ['pipelineManaged'], equals: true } },
|
||||
},
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
|
||||
if (manualJob) {
|
||||
log(`Found manual job: ${manualJob.type}${manualJob.language ? `:${manualJob.language}` : ''} (${manualJob.id})`);
|
||||
await startJobProcess(
|
||||
manualJob.id,
|
||||
manualJob.type,
|
||||
manualJob.language,
|
||||
manualJob.config as Record<string, unknown> | null
|
||||
);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
// 2. If jobs are still running for the current group, wait
|
||||
if (cycleState.activeGroupJobs > 0) {
|
||||
return;
|
||||
}
|
||||
|
||||
// 3. If in cooldown, check if expired
|
||||
if (cycleState.waitingForCooldown) {
|
||||
if (cycleState.lastCycleCompletedAt) {
|
||||
const elapsed = Date.now() - cycleState.lastCycleCompletedAt.getTime();
|
||||
if (elapsed < CYCLE_COOLDOWN_MS) {
|
||||
const remaining = Math.round((CYCLE_COOLDOWN_MS - elapsed) / 60_000);
|
||||
if (remaining % 30 === 0 || remaining <= 5) {
|
||||
log(`Cooldown: ${remaining} minutes remaining before next cycle`);
|
||||
}
|
||||
return;
|
||||
}
|
||||
}
|
||||
cycleState.waitingForCooldown = false;
|
||||
cycleState.currentGroupIndex = 0;
|
||||
cycleState.currentSequentialPhaseIndex = 0;
|
||||
log('Cooldown expired, starting new cycle');
|
||||
}
|
||||
|
||||
// 4. If past the last group, complete the cycle
|
||||
if (cycleState.currentGroupIndex >= PIPELINE_GROUPS.length) {
|
||||
cycleState.cycleNumber++;
|
||||
cycleState.lastCycleCompletedAt = new Date();
|
||||
cycleState.waitingForCooldown = true;
|
||||
const cooldownHours = CYCLE_COOLDOWN_MS / (60 * 60 * 1000);
|
||||
log(`=== Cycle ${cycleState.cycleNumber} complete! Entering ${cooldownHours}h cooldown ===`);
|
||||
return;
|
||||
}
|
||||
|
||||
// 5. Start the current group
|
||||
const group = PIPELINE_GROUPS[cycleState.currentGroupIndex];
|
||||
|
||||
if (cycleState.currentGroupIndex === 0 && cycleState.currentSequentialPhaseIndex === 0 && !cycleState.cycleStartedAt) {
|
||||
cycleState.cycleStartedAt = new Date();
|
||||
log(`=== Starting cycle ${cycleState.cycleNumber + 1} ===`);
|
||||
}
|
||||
|
||||
if (group.mode === 'parallel') {
|
||||
// Launch all phases in the group concurrently
|
||||
log(`Pipeline group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length}: ${group.name} (parallel, ${group.phases.length} jobs)`);
|
||||
cycleState.activeGroupJobs = group.phases.length;
|
||||
|
||||
for (const phase of group.phases) {
|
||||
const jobId = await createPendingJob(
|
||||
phase.type,
|
||||
phase.language,
|
||||
{ ...phase.config, pipelineManaged: true }
|
||||
);
|
||||
await startJobProcess(jobId, phase.type, phase.language || null, phase.config);
|
||||
}
|
||||
} else {
|
||||
// Sequential: run one phase at a time within the group
|
||||
const phaseIndex = cycleState.currentSequentialPhaseIndex;
|
||||
if (phaseIndex >= group.phases.length) {
|
||||
// All phases in this sequential group are done
|
||||
cycleState.currentGroupIndex++;
|
||||
cycleState.currentSequentialPhaseIndex = 0;
|
||||
return; // Will pick up next group on next poll
|
||||
}
|
||||
|
||||
const phase = group.phases[phaseIndex];
|
||||
log(`Pipeline group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length}: ${group.name} (sequential ${phaseIndex + 1}/${group.phases.length}: ${phase.name})`);
|
||||
cycleState.activeGroupJobs = 1;
|
||||
|
||||
const jobId = await createPendingJob(
|
||||
phase.type,
|
||||
phase.language,
|
||||
{ ...phase.config, pipelineManaged: true }
|
||||
);
|
||||
await startJobProcess(jobId, phase.type, phase.language || null, phase.config);
|
||||
}
|
||||
} catch (err) {
|
||||
logError(`Error in pipeline: ${err}`);
|
||||
}
|
||||
}
|
||||
|
||||
function onJobCompleted(): void {
|
||||
cycleState.activeGroupJobs--;
|
||||
|
||||
if (cycleState.activeGroupJobs <= 0) {
|
||||
cycleState.activeGroupJobs = 0;
|
||||
const group = PIPELINE_GROUPS[cycleState.currentGroupIndex];
|
||||
|
||||
if (group?.mode === 'sequential') {
|
||||
cycleState.currentSequentialPhaseIndex++;
|
||||
// Check if there are more phases in this sequential group
|
||||
if (cycleState.currentSequentialPhaseIndex < group.phases.length) {
|
||||
return; // Don't advance group yet
|
||||
}
|
||||
}
|
||||
|
||||
// Advance to next group
|
||||
cycleState.currentGroupIndex++;
|
||||
cycleState.currentSequentialPhaseIndex = 0;
|
||||
log(`Group "${group?.name}" complete, advancing to group ${cycleState.currentGroupIndex + 1}`);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 4: Update startJobProcess callbacks**
|
||||
|
||||
In the `child.on('close')` callback (line 442) and `child.on('error')` callback (line 472), replace `advancePipelinePhase()` with `onJobCompleted()`.
|
||||
|
||||
**Step 5: Update crash recovery**
|
||||
|
||||
In `recoverFromCrash` (lines 259-268), replace the `PIPELINE_PHASES.findIndex` logic with a search through `PIPELINE_GROUPS`:
|
||||
|
||||
```typescript
|
||||
if (lastRunningPipelineJob) {
|
||||
for (let gi = 0; gi < PIPELINE_GROUPS.length; gi++) {
|
||||
const group = PIPELINE_GROUPS[gi];
|
||||
const phaseIdx = group.phases.findIndex(
|
||||
p => p.type === lastRunningPipelineJob.type &&
|
||||
(p.language || null) === (lastRunningPipelineJob.language || null)
|
||||
);
|
||||
if (phaseIdx >= 0) {
|
||||
cycleState.currentGroupIndex = gi;
|
||||
cycleState.currentSequentialPhaseIndex = group.mode === 'sequential' ? phaseIdx : 0;
|
||||
log(`Resuming pipeline from group ${gi + 1}: ${group.name}`);
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 6: Update heartbeat log in main()**
|
||||
|
||||
Replace the heartbeat cron (lines 551-562) and the startup log (lines 574-580) to reference groups instead of phases:
|
||||
|
||||
```typescript
|
||||
cron.schedule('0 * * * *', () => {
|
||||
const currentGroup = cycleState.currentGroupIndex < PIPELINE_GROUPS.length
|
||||
? PIPELINE_GROUPS[cycleState.currentGroupIndex].name
|
||||
: 'none';
|
||||
const jobs = runningJobs.size > 0
|
||||
? `Running: ${[...runningJobs.keys()].join(', ')}`
|
||||
: 'No jobs running';
|
||||
const state = cycleState.waitingForCooldown
|
||||
? 'cooldown'
|
||||
: `group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length} (${currentGroup})`;
|
||||
log(`Heartbeat: Cycle ${cycleState.cycleNumber + 1}, ${state}. ${jobs}`);
|
||||
}, { timezone: 'UTC' });
|
||||
```
|
||||
|
||||
For the startup log:
|
||||
|
||||
```typescript
|
||||
log('=== Scheduler running (parallel grouped pipeline) ===');
|
||||
log(`Pipeline groups (${PIPELINE_GROUPS.length}):`);
|
||||
for (let i = 0; i < PIPELINE_GROUPS.length; i++) {
|
||||
const g = PIPELINE_GROUPS[i];
|
||||
const phaseNames = g.phases.map(p => p.name).join(', ');
|
||||
log(` ${i + 1}. ${g.name} [${g.mode}]: ${phaseNames}`);
|
||||
}
|
||||
```
|
||||
|
||||
**Step 7: Remove dead Google Places env log**
|
||||
|
||||
Delete lines 167-169 (the `GOOGLE_PLACES_API_KEY` log in `validateEnvironment`).
|
||||
|
||||
**Step 8: Verify build**
|
||||
|
||||
Run: `npm run build`
|
||||
Expected: Build succeeds
|
||||
|
||||
**Step 9: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/scheduler.ts
|
||||
git commit -m "feat: parallel grouped pipeline scheduler
|
||||
|
||||
Replace sequential pipeline with grouped phases. Import phases run
|
||||
sequentially, scraper phases run in parallel groups of 3. This reduces
|
||||
cycle time from days to hours. Generic scraper moved to last group."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Increase Scheduler Memory Limit
|
||||
|
||||
**Files:**
|
||||
- Modify: `docker-compose.yml:217-220`
|
||||
|
||||
**Step 1: Increase memory limit**
|
||||
|
||||
Change the scheduler service's `deploy.resources.limits.memory` from `4G` to `10G`:
|
||||
|
||||
```yaml
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 10G
|
||||
```
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add docker-compose.yml
|
||||
git commit -m "chore: increase scheduler memory to 10G for parallel scrapers"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Deploy and Verify
|
||||
|
||||
**Step 1: Deploy to NAS**
|
||||
|
||||
```bash
|
||||
rsync -avz --exclude 'node_modules' --exclude '.next' --exclude '.git' --exclude '.env.local' --exclude '*.log' \
|
||||
/Users/albert/Documents/Projects/Church/ScraperControl/ albert@192.168.0.145:/volume1/docker/scraper-control/
|
||||
```
|
||||
|
||||
**Step 2: Rebuild and restart scheduler**
|
||||
|
||||
```bash
|
||||
ssh albert@192.168.0.145 'cd /volume1/docker/scraper-control && /usr/local/bin/docker compose build scheduler && /usr/local/bin/docker compose up -d scheduler'
|
||||
```
|
||||
|
||||
**Step 3: Verify logs show parallel groups**
|
||||
|
||||
```bash
|
||||
ssh albert@192.168.0.145 '/usr/local/bin/docker logs --tail 30 scraper-control-scheduler-1'
|
||||
```
|
||||
|
||||
Expected: Logs show "parallel grouped pipeline", group listings with `[parallel]` and `[sequential]` tags, and eventually multiple concurrent `Running:` entries in heartbeat.
|
||||
72
docs/plans/2026-02-26-horariosmisas-spain-design.md
Normal file
72
docs/plans/2026-02-26-horariosmisas-spain-design.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Spain Church Importer (horariosmisas.com) — Design
|
||||
|
||||
## Overview
|
||||
|
||||
Import ~10,000 Spanish churches with mass schedules from horariosmisas.com. Static WordPress site with fully permissive robots.txt and sitemaps. No Playwright needed — simple HTTP + HTML parsing.
|
||||
|
||||
## Data Source
|
||||
|
||||
- **Site:** https://horariosmisas.com
|
||||
- **Coverage:** 18,000+ churches claimed, ~10,000 in sitemaps across 52 Spanish provinces
|
||||
- **Data:** Church name, address, phone, website, mass schedules (summer/winter seasonal variants)
|
||||
- **No coordinates** — addresses only. Forward geocoding via Nominatim as a separate pass.
|
||||
- **robots.txt:** Fully permissive (`User-agent: * / Disallow:`)
|
||||
- **Sitemaps:** 20 post sitemaps + 7 category sitemaps
|
||||
|
||||
## Architecture
|
||||
|
||||
### Two-Pass Approach
|
||||
|
||||
**Pass 1: Import** — Fetch all churches from sitemaps, parse HTML, match against existing Spanish OSM churches, upsert with mass schedules. Unmatched churches created with address but no coordinates.
|
||||
|
||||
**Pass 2: Geocode** — Forward-geocode unmatched churches via Nominatim public API (`address → lat/lng`). 1 req/sec rate limit.
|
||||
|
||||
### Schema Change
|
||||
|
||||
Add `horariosMisasId String? @unique` to Church model (same pattern as `philmassId`, `massSchedulesPhId`). Update church matcher and all existing importers.
|
||||
|
||||
### URL Structure
|
||||
|
||||
- Sitemap index: `/sitemap_index.xml` → 20 post sitemaps
|
||||
- Church pages: `/{province}/{city}/{church-slug}/`
|
||||
- Non-church posts (filtered out): `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, etc.
|
||||
|
||||
### HTML Parsing
|
||||
|
||||
- **Name:** `<h1>Church Name (City)</h1>` — strip `(City)` suffix
|
||||
- **Address:** `<p>📌 <strong>Street, PostalCode City (Province)</strong></p>`
|
||||
- **Phone:** `<strong>Teléfono:</strong> <a href="tel:...">...</a>`
|
||||
- **Website:** `<strong>Página Web:</strong> <a href="...">...</a>`
|
||||
- **Schedule:** `<table>` with `DÍA`/`HORARIO` columns
|
||||
- Two seasonal tables: `☀️ Horario de verano` and `⛄ Misas en invierno`
|
||||
- Import seasonally appropriate one (Oct-May = winter, Jun-Sep = summer)
|
||||
- Day names: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado, Domingos y Festivos
|
||||
- Day ranges: "Lunes a Viernes" (Monday-Friday)
|
||||
- Time format: `HH:MMh` (24-hour), multiple per cell via `<br>`
|
||||
- Annotations stripped: `(familias)`, etc.
|
||||
|
||||
### Matching Strategy
|
||||
|
||||
1. `horariosMisasId` exact match (for re-imports)
|
||||
2. Name + proximity against existing Spanish churches (from OSM)
|
||||
3. Unmatched: create new church with address, country=ES, no coordinates
|
||||
|
||||
### CLI
|
||||
|
||||
```
|
||||
npx tsx scripts/import-horariosmisas.ts --all
|
||||
npx tsx scripts/import-horariosmisas.ts --all --dry-run
|
||||
npx tsx scripts/import-horariosmisas.ts --province madrid
|
||||
npx tsx scripts/import-horariosmisas.ts --all --geocode
|
||||
npx tsx scripts/import-horariosmisas.ts --geocode-only
|
||||
npx tsx scripts/import-horariosmisas.ts --all --resume-from 5000
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
- Import: 1.5s between requests (~10,000 × 1.5s = ~4.2 hours)
|
||||
- Geocode: 1s between requests (Nominatim public API limit)
|
||||
|
||||
### Scheduler Integration
|
||||
|
||||
Add to PIPELINE_GROUPS imports group (sequential, after philmass-import).
|
||||
322
docs/plans/2026-02-26-horariosmisas-spain.md
Normal file
322
docs/plans/2026-02-26-horariosmisas-spain.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Spain Church Importer (horariosmisas.com) — Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Import ~10,000 Spanish churches with mass schedules from horariosmisas.com, with optional Nominatim forward geocoding for unmatched churches.
|
||||
|
||||
**Architecture:** Sitemap-driven importer. Fetch 20 post sitemaps for church URLs, parse static WordPress HTML for names/addresses/schedule tables, match against existing Spanish OSM churches, upsert with mass schedules. Separate geocoding pass via Nominatim public API.
|
||||
|
||||
**Tech Stack:** TypeScript, Prisma, HTML parsing (regex — no Playwright), Nominatim geocoding API.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Add `horariosMisasId` to Prisma Schema
|
||||
|
||||
**Files:**
|
||||
- Modify: `prisma/schema.prisma`
|
||||
|
||||
**Step 1: Add field and index**
|
||||
|
||||
After the `philmassId` line (around line 38), add:
|
||||
|
||||
```prisma
|
||||
horariosMisasId String? @unique @map("horarios_misas_id") // horariosmisas.com URL slug
|
||||
```
|
||||
|
||||
And add an index in the `@@index` block (around line 78):
|
||||
|
||||
```prisma
|
||||
@@index([horariosMisasId])
|
||||
```
|
||||
|
||||
**Step 2: Push schema to NAS database**
|
||||
|
||||
```bash
|
||||
npx prisma db push --accept-data-loss
|
||||
```
|
||||
|
||||
Expected: `Your database is now in sync with your Prisma schema.`
|
||||
|
||||
**Step 3: Regenerate Prisma client**
|
||||
|
||||
```bash
|
||||
npx prisma generate
|
||||
```
|
||||
|
||||
**Step 4: Push schema to Neon production**
|
||||
|
||||
```bash
|
||||
npx prisma db push --url "$(grep DATABASE_URL .env.production | sed 's/DATABASE_URL="//' | sed 's/"$//')" --accept-data-loss
|
||||
```
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add prisma/schema.prisma
|
||||
git commit -m "feat: add horariosMisasId to Church model for Spain import"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Extend Church Matcher and Existing Importers
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/church-matcher.ts`
|
||||
- Modify: `scripts/import-osm-churches.ts`
|
||||
- Modify: `scripts/import-gcatholic.ts`
|
||||
- Modify: `scripts/import-baidu-churches.ts`
|
||||
- Modify: `scripts/import-osm-region.ts`
|
||||
- Modify: `scripts/import-orarimesse.ts`
|
||||
- Modify: `scripts/import-mass-schedules-ph.ts`
|
||||
- Modify: `scripts/import-philmass.ts`
|
||||
|
||||
### Step 1: Update church-matcher.ts
|
||||
|
||||
In `ExistingChurch` interface (line ~11-26), add after `philmassId`:
|
||||
|
||||
```typescript
|
||||
horariosMisasId: string | null;
|
||||
```
|
||||
|
||||
In `ChurchCandidate` type (line ~113-122), add after `philmassId`:
|
||||
|
||||
```typescript
|
||||
horariosMisasId?: string;
|
||||
```
|
||||
|
||||
In `findDuplicateChurch()`, add a new pass after the fifth pass (philmassId match, line ~169-175). Before the proximity+name pass:
|
||||
|
||||
```typescript
|
||||
// Sixth pass: exact horariosMisasId match
|
||||
if (candidate.horariosMisasId) {
|
||||
const horariosMisasMatch = existingChurches.find(
|
||||
(church) => church.horariosMisasId === candidate.horariosMisasId
|
||||
);
|
||||
if (horariosMisasMatch) return horariosMisasMatch;
|
||||
}
|
||||
```
|
||||
|
||||
Update the comment on the proximity pass to say "Seventh pass".
|
||||
|
||||
### Step 2: Update all existing importers
|
||||
|
||||
In every importer that queries churches with a `select` clause containing `philmassId: true`, add:
|
||||
|
||||
```typescript
|
||||
horariosMisasId: true,
|
||||
```
|
||||
|
||||
In every importer that creates/pushes churches with `philmassId: null`, add:
|
||||
|
||||
```typescript
|
||||
horariosMisasId: null,
|
||||
```
|
||||
|
||||
**Files to update:** `import-osm-churches.ts`, `import-gcatholic.ts`, `import-baidu-churches.ts`, `import-osm-region.ts`, `import-orarimesse.ts`, `import-mass-schedules-ph.ts`, `import-philmass.ts`
|
||||
|
||||
### Step 3: Verify build
|
||||
|
||||
```bash
|
||||
npx tsc --noEmit
|
||||
```
|
||||
|
||||
Expected: No errors.
|
||||
|
||||
### Step 4: Commit
|
||||
|
||||
```bash
|
||||
git add src/lib/church-matcher.ts scripts/import-*.ts
|
||||
git commit -m "feat: add horariosMisasId to church matcher and all importers"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Create `import-horariosmisas.ts`
|
||||
|
||||
**Files:**
|
||||
- Create: `scripts/import-horariosmisas.ts`
|
||||
|
||||
### Architecture
|
||||
|
||||
This importer follows the exact same structure as `scripts/import-mass-schedules-ph.ts`. Key differences:
|
||||
|
||||
- **Sitemap:** Fetches 20 post sitemaps from sitemap index (not a single sitemap)
|
||||
- **URL filtering:** Church URLs have 3 path segments (`/{province}/{city}/{slug}/`). Non-church URLs (blog posts, daily readings) are filtered out.
|
||||
- **Schedule parsing:** Two seasonal tables (summer/winter). Import seasonally appropriate one based on current month.
|
||||
- **Day names:** Spanish (`Lunes`, `Martes`, etc.) with range support (`Lunes a Viernes`)
|
||||
- **Times:** 24-hour `HH:MMh` format (e.g., `08:00h`, `20:30h`)
|
||||
- **No coordinates:** Churches created with `latitude: 0, longitude: 0` — geocoded separately
|
||||
- **Geocoding:** Optional `--geocode` flag uses Nominatim public API (1 req/sec)
|
||||
|
||||
### Constants
|
||||
|
||||
```typescript
|
||||
const SITE_BASE = 'https://horariosmisas.com';
|
||||
const SITEMAP_INDEX_URL = `${SITE_BASE}/sitemap_index.xml`;
|
||||
const USER_AGENT = 'NearestMass-Importer/1.0 (parish data aggregator; contact: privacy@nearestmass.com)';
|
||||
const REQUEST_DELAY_MS = 1500;
|
||||
const NOMINATIM_DELAY_MS = 1100;
|
||||
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search';
|
||||
```
|
||||
|
||||
### Spanish Day Mapping
|
||||
|
||||
```typescript
|
||||
const DAY_MAP: Record<string, number[]> = {
|
||||
'domingos y festivos': [0],
|
||||
'domingos': [0],
|
||||
'domingo': [0],
|
||||
'lunes': [1],
|
||||
'martes': [2],
|
||||
'miércoles': [3],
|
||||
'miercoles': [3],
|
||||
'jueves': [4],
|
||||
'viernes': [5],
|
||||
'sábado': [6],
|
||||
'sabado': [6],
|
||||
'sábados': [6],
|
||||
'sabados': [6],
|
||||
};
|
||||
```
|
||||
|
||||
### Sitemap Fetching
|
||||
|
||||
1. Fetch sitemap index → extract `post-sitemap*.xml` URLs
|
||||
2. Fetch each post sitemap → extract URLs with exactly 3 path segments
|
||||
3. Filter out non-church URLs (patterns: `/misas-diarias/`, `/santos-del-dia/`, `/oraciones/`, `/noticias/`, `/blog/`, `/contacto/`, `/aviso-legal/`, `/politica-de-privacidad/`, `/politica-de-cookies/`)
|
||||
4. Deduplicate by slug
|
||||
|
||||
### HTML Parsing
|
||||
|
||||
**Church name:** `<h1>Church Name (City)</h1>` → strip `(City)` suffix
|
||||
|
||||
**Address:** `📌 <strong>Calle Goya, 26 28001 Madrid (Madrid)</strong>` → extract street, postal code (5-digit `\b\d{5}\b`), city (text after postal code), strip `(Province)` suffix
|
||||
|
||||
**Phone:** `<strong>Teléfono:</strong> <a href="tel:...">number</a>`
|
||||
|
||||
**Website:** `<strong>Página Web:</strong> <a href="url">...</a>`
|
||||
|
||||
**Schedule tables:** Find `<table>` elements with DÍA/HORARIO headers. Split by seasonal headings (☀️ verano / ⛄ invierno). Pick seasonally appropriate section (Oct-May = winter, Jun-Sep = summer). Parse `<td>` cells: first cell = day name(s), second cell = times. Times in `HH:MMh` format extracted via regex `(\d{1,2}):(\d{2})\s*h?`.
|
||||
|
||||
### Day Range Resolution
|
||||
|
||||
Support ranges like `Lunes a Viernes` → [1,2,3,4,5] and compound entries like `Lunes, Miércoles y Viernes` → [1,3,5].
|
||||
|
||||
### Geocoding (--geocode / --geocode-only)
|
||||
|
||||
Query Nominatim with: `{address}, Spain` → fallback to `{postalCode} {city}, Spain` → fallback to `{city}, Spain`. Use `countrycodes=es` parameter. Max 1 req/sec.
|
||||
|
||||
### Matching Strategy
|
||||
|
||||
1. `horariosMisasId` exact match (primary — for re-imports)
|
||||
2. Name + proximity against existing Spanish OSM churches (secondary)
|
||||
3. Unmatched: create new church with `latitude: 0, longitude: 0`, country=ES
|
||||
|
||||
### CLI
|
||||
|
||||
```
|
||||
--all Import all churches from sitemaps
|
||||
--province <name> Import only churches from this province
|
||||
--dry-run No database writes
|
||||
--geocode After import, geocode unmatched churches
|
||||
--geocode-only Only geocode (skip import)
|
||||
--resume-from <n> Skip first N churches
|
||||
--job-id <uuid> Background job tracking
|
||||
```
|
||||
|
||||
### Mass Schedule Language
|
||||
|
||||
Set `language: 'Spanish'` on all created mass schedules.
|
||||
|
||||
### Step 1: Create the file
|
||||
|
||||
Use `scripts/import-mass-schedules-ph.ts` as the structural template. Implement all functions described above.
|
||||
|
||||
### Step 2: Verify build
|
||||
|
||||
```bash
|
||||
npx tsc --noEmit
|
||||
```
|
||||
|
||||
### Step 3: Dry-run test
|
||||
|
||||
```bash
|
||||
npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run
|
||||
```
|
||||
|
||||
### Step 4: Commit
|
||||
|
||||
```bash
|
||||
git add scripts/import-horariosmisas.ts
|
||||
git commit -m "feat: add horariosmisas.com Spain church importer"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Add to Scheduler Pipeline and npm Scripts
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/scheduler.ts`
|
||||
- Modify: `package.json`
|
||||
|
||||
### Step 1: Add to PIPELINE_GROUPS
|
||||
|
||||
In `scripts/scheduler.ts`, in the `imports` group (line ~40-51), add after the `philmass-import` entry:
|
||||
|
||||
```typescript
|
||||
{ name: 'horariosmisas-import', type: 'horariosmisas-import', config: {} },
|
||||
```
|
||||
|
||||
### Step 2: Add getJobCommand case
|
||||
|
||||
In the `getJobCommand` function (around line ~182), before the `default:` case, add:
|
||||
|
||||
```typescript
|
||||
case 'horariosmisas-import': {
|
||||
const args = ['tsx', 'scripts/import-horariosmisas.ts', '--all', '--geocode'];
|
||||
if (config?.province) args.push('--province', String(config.province));
|
||||
if (config?.resumeFrom) args.push('--resume-from', String(config.resumeFrom));
|
||||
return { command: 'npx', args };
|
||||
}
|
||||
```
|
||||
|
||||
### Step 3: Add npm scripts
|
||||
|
||||
In `package.json`, add after the `"import:philmass"` line:
|
||||
|
||||
```json
|
||||
"import:horariosmisas": "tsx scripts/import-horariosmisas.ts",
|
||||
```
|
||||
|
||||
### Step 4: Verify build
|
||||
|
||||
```bash
|
||||
npx tsc --noEmit
|
||||
```
|
||||
|
||||
### Step 5: Commit
|
||||
|
||||
```bash
|
||||
git add scripts/scheduler.ts package.json
|
||||
git commit -m "feat: add horariosmisas import to scheduler pipeline"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
1. **Dry run on single province**: `npx tsx scripts/import-horariosmisas.ts --province navarra --dry-run`
|
||||
- Verify: church names parsed correctly, schedules extracted, matches found
|
||||
2. **Dry run on Madrid**: `npx tsx scripts/import-horariosmisas.ts --province madrid --dry-run`
|
||||
- Verify: larger province, summer/winter schedule selection, address parsing
|
||||
3. **Single province real import**: `npx tsx scripts/import-horariosmisas.ts --province navarra`
|
||||
- Verify: churches created/updated, mass schedules in database
|
||||
4. **Geocode test**: `npx tsx scripts/import-horariosmisas.ts --geocode-only --dry-run`
|
||||
- Verify: finds churches needing geocoding, Nominatim returns coordinates
|
||||
5. **Full import**: `npx tsx scripts/import-horariosmisas.ts --all --geocode`
|
||||
|
||||
## Runtime Estimate
|
||||
|
||||
- Sitemap fetch: 20 sitemaps x 1.5s = ~30s
|
||||
- Import: ~10,000 churches x 1.5s = ~4.2 hours
|
||||
- Geocode: depends on unmatched count x 1.1s
|
||||
103
docs/plans/2026-03-01-weekdaymasses-importer-design.md
Normal file
103
docs/plans/2026-03-01-weekdaymasses-importer-design.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# weekdaymasses.org.uk Global Importer
|
||||
|
||||
## Context
|
||||
|
||||
weekdaymasses.org.uk is a UK-based Catholic directory covering ~3,500-4,000 churches globally with mass schedules, coordinates, addresses, and phone numbers. Covers GB, Ireland, and 49+ international countries (India, Sri Lanka, South Korea, Japan, and more). All data served on single HTML pages per area — no pagination or API needed.
|
||||
|
||||
## Data Source
|
||||
|
||||
Three area pages cover the entire site:
|
||||
|
||||
| Page | URL | Est. Churches |
|
||||
|------|-----|---------------|
|
||||
| GB | `/en/area/gb/churches` | ~3,000+ |
|
||||
| Ireland | `/en/area/ireland/churches` | ~300+ |
|
||||
| Outside GB | `/en/area/outside-gb/churches` | ~152+ |
|
||||
|
||||
Individual country/region pages (e.g. `/en/area/india/churches`) are subsets of these three.
|
||||
|
||||
### Data per church
|
||||
|
||||
- **Name**: h3 heading, format "Church Name (Location)"
|
||||
- **Address**: plain text after mass times, with postal/zip code
|
||||
- **Coordinates**: in map link query params `lat=XX.XXXX&lon=YY.YYYY&church_id=NNNNN`
|
||||
- **Mass times**: format `Day: HH.MMam/pm(Language), HH.MMam/pm(Language)`
|
||||
- **Phone**: `Tel: +XX XXXX XXXXXX`
|
||||
- **Website**: occasional links
|
||||
- **church_id**: unique numeric identifier in map links
|
||||
|
||||
### Mass time format
|
||||
|
||||
```
|
||||
Sunday: 6.30am(Tamil), 8.30am(Tamil), 5.30pm(English)
|
||||
Mon Tue Wed Thu Fri: 6.30am(Tamil)
|
||||
Saturday: 6.30am(Tamil), 5.30pm(English)
|
||||
```
|
||||
|
||||
Day labels: `Sunday`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, `Saturday`, or combinations like `Mon Tue Wed Thu Fri`. Also `Holy Day` entries.
|
||||
|
||||
Time format: `H.MMam/pm` — needs conversion to 24h `HH:MM`.
|
||||
|
||||
Language in parentheses maps to our `language` field on mass_schedules.
|
||||
|
||||
### Country detection
|
||||
|
||||
The address is the last line of each church entry. Country can be detected by:
|
||||
- GB: UK postal code pattern (e.g. `SW1A 1AA`)
|
||||
- Ireland: Irish Eircode (e.g. `D01 F5P2`) or "Ireland" in address
|
||||
- India: 6-digit postal code (e.g. `600088`)
|
||||
- Others: country name at end of address, or fallback to the area page being scraped
|
||||
|
||||
## Design
|
||||
|
||||
### Schema
|
||||
|
||||
Add to Church model in both BethelGuide and ScraperControl:
|
||||
|
||||
```prisma
|
||||
weekdayMassesId String? @unique @map("weekday_masses_id")
|
||||
@@index([weekdayMassesId])
|
||||
```
|
||||
|
||||
### Script: `scripts/import-weekdaymasses.ts`
|
||||
|
||||
Single script that:
|
||||
|
||||
1. Fetches area pages (default: all 3; filterable with `--area gb|ireland|outside-gb|india|...`)
|
||||
2. Parses HTML into structured church entries
|
||||
3. Converts mass times from `H.MMam/pm` to `HH:MM` 24h format
|
||||
4. Detects country from address patterns
|
||||
5. Matches against existing churches by `weekdayMassesId` (exact) then proximity+name
|
||||
6. Upserts churches and replaces mass schedules
|
||||
|
||||
### HTML parsing strategy
|
||||
|
||||
Each church is a block between consecutive h3 headings. Within each block:
|
||||
- h3 content = church name
|
||||
- Lines with day labels + times = mass schedule
|
||||
- Map link = coordinates + church_id
|
||||
- Last text block before next h3 = address
|
||||
- `Tel:` prefix = phone
|
||||
|
||||
### CLI flags
|
||||
|
||||
- `--all` — import all 3 area pages
|
||||
- `--area <name>` — import specific area (gb, ireland, outside-gb, india, sri-lanka, etc.)
|
||||
- `--dry-run` — no database writes
|
||||
- `--resume-from <n>` — skip first N churches
|
||||
- `--job-id <uuid>` — background job tracking
|
||||
|
||||
### Church matcher integration
|
||||
|
||||
Add `weekdayMassesId` to `ExistingChurch`, `ChurchCandidate`, and a new match pass in `findDuplicateChurch()`.
|
||||
|
||||
### Scheduler integration
|
||||
|
||||
Add `weekdaymasses-import` to the sequential imports group in the pipeline, with `getJobCommand()` case and npm script.
|
||||
|
||||
## Scope
|
||||
|
||||
- ~3,500-4,000 churches with mass schedules
|
||||
- Most GB/Ireland churches already in DB from OSM (will match and add schedules)
|
||||
- India/Sri Lanka/international churches partially in DB from OSM/gcatholic
|
||||
- Value: mass schedule data for thousands of churches that currently have none
|
||||
309
docs/superpowers/plans/2026-03-28-freesearch-stability.md
Normal file
309
docs/superpowers/plans/2026-03-28-freesearch-stability.md
Normal file
@@ -0,0 +1,309 @@
|
||||
# FreeSearch Stability & Scheduler Healthcheck Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Make the `freesearch-enrichment` container stay alive when FreeSearch is down, clean up stale running jobs on restart, and fix the scheduler's perpetually-failing Docker healthcheck.
|
||||
|
||||
**Architecture:** Three targeted edits across two scripts and docker-compose. `enrich-with-freesearch.ts` gets a `waitForFreeSearch()` startup loop and a stale-job cleanup before job creation. `scheduler.ts` writes a heartbeat file on each hourly cron tick. `docker-compose.yml` swaps the `pgrep` healthcheck for a file-age check on that heartbeat file.
|
||||
|
||||
**Tech Stack:** TypeScript/tsx, Prisma, Docker Compose, node-cron, bash (healthcheck command)
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
- Modify: `scripts/enrich-with-freesearch.ts:872-880` — add `waitForFreeSearch()` function
|
||||
- Modify: `scripts/enrich-with-freesearch.ts:1272-1296` — replace startup exit with wait call + stale job cleanup
|
||||
- Modify: `scripts/scheduler.ts:747-758` — write heartbeat file in hourly cron
|
||||
- Modify: `docker-compose.yml:275-280` — replace scheduler healthcheck
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add `waitForFreeSearch()` to the enrichment script
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/enrich-with-freesearch.ts`
|
||||
|
||||
The existing `healthCheck()` function (line 872) returns a boolean. We add `waitForFreeSearch()` directly below it — a loop that calls `healthCheck()` and sleeps with exponential backoff until it succeeds.
|
||||
|
||||
- [ ] **Step 1: Add `waitForFreeSearch()` after `healthCheck()`**
|
||||
|
||||
In `scripts/enrich-with-freesearch.ts`, find this block (around line 872):
|
||||
|
||||
```typescript
|
||||
async function healthCheck(): Promise<boolean> {
|
||||
try {
|
||||
const resp = await axios.get(`${FREESEARCH_URL}/api/health`, { timeout: 5000 });
|
||||
return resp.status === 200;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Add the following function immediately after it:
|
||||
|
||||
```typescript
|
||||
async function waitForFreeSearch(): Promise<void> {
|
||||
let backoffMs = 30_000;
|
||||
const maxBackoffMs = 300_000; // 5 minutes
|
||||
let attempt = 0;
|
||||
|
||||
while (!shuttingDown) {
|
||||
attempt++;
|
||||
const healthy = await healthCheck();
|
||||
if (healthy) {
|
||||
if (attempt > 1) log('FreeSearch is back. Continuing...');
|
||||
return;
|
||||
}
|
||||
const waitSec = Math.round(backoffMs / 1000);
|
||||
logError(`FreeSearch not reachable at ${FREESEARCH_URL} (attempt ${attempt}). Retrying in ${waitSec}s...`);
|
||||
await sleep(backoffMs);
|
||||
backoffMs = Math.min(backoffMs * 2, maxBackoffMs);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Replace the startup health check block in `main()`**
|
||||
|
||||
Find this block in `main()` (around line 1272):
|
||||
|
||||
```typescript
|
||||
// Health check
|
||||
log('Checking FreeSearch health...');
|
||||
const healthy = await healthCheck();
|
||||
if (!healthy) {
|
||||
logError(`FreeSearch not reachable at ${FREESEARCH_URL}`);
|
||||
logError('Make sure FreeSearch is running and accessible.');
|
||||
process.exit(1);
|
||||
}
|
||||
log('FreeSearch health check: OK');
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```typescript
|
||||
// Wait for FreeSearch to be reachable (indefinite retry with backoff)
|
||||
log('Waiting for FreeSearch to be reachable...');
|
||||
await waitForFreeSearch();
|
||||
if (shuttingDown) return;
|
||||
log('FreeSearch health check: OK');
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add stale job cleanup before job creation**
|
||||
|
||||
Find this block in `main()` (around line 1291):
|
||||
|
||||
```typescript
|
||||
// Job tracking
|
||||
let jobId = await createOrResumeJob(args);
|
||||
if (!jobId) {
|
||||
jobId = await createNewJob({ countryCode, limit, continuous, dryRun, reSearch });
|
||||
}
|
||||
log(`Job ID: ${jobId}`);
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```typescript
|
||||
// Job tracking — clean up any running jobs left by a previous container restart
|
||||
await prisma.backgroundJob.updateMany({
|
||||
where: { type: 'freesearch-enrichment', status: 'running' },
|
||||
data: { status: 'failed', error: 'Container restarted', completedAt: new Date() },
|
||||
});
|
||||
|
||||
let jobId = await createOrResumeJob(args);
|
||||
if (!jobId) {
|
||||
jobId = await createNewJob({ countryCode, limit, continuous, dryRun, reSearch });
|
||||
}
|
||||
log(`Job ID: ${jobId}`);
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Verify the script compiles**
|
||||
|
||||
```bash
|
||||
cd /home/albert/Documents/ScraperControl
|
||||
npx tsc --noEmit
|
||||
```
|
||||
|
||||
Expected: no errors (or only pre-existing errors unrelated to this change).
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/enrich-with-freesearch.ts
|
||||
git commit -m "fix: wait for FreeSearch on startup instead of exiting; clean stale jobs"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Write heartbeat file in scheduler
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/scheduler.ts`
|
||||
|
||||
The scheduler already has an hourly cron that logs a heartbeat message (lines 747-758). We add a single `fs.writeFileSync` call inside it to write the timestamp to `/app/logs/scheduler.heartbeat`. The `logs/` directory is already created by `ensureLogsDir()` at startup.
|
||||
|
||||
- [ ] **Step 1: Add heartbeat file write inside the hourly cron**
|
||||
|
||||
Find this block in `scripts/scheduler.ts` (around line 747):
|
||||
|
||||
```typescript
|
||||
// Heartbeat every hour — logs cycle state
|
||||
cron.schedule('0 * * * *', () => {
|
||||
const currentGroup = cycleState.currentGroupIndex < PIPELINE_GROUPS.length
|
||||
? PIPELINE_GROUPS[cycleState.currentGroupIndex].name
|
||||
: 'none';
|
||||
const jobs = runningJobs.size > 0
|
||||
? `Running: ${[...runningJobs.keys()].join(', ')}`
|
||||
: 'No jobs running';
|
||||
const state = cycleState.waitingForCooldown
|
||||
? 'cooldown'
|
||||
: `group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length} (${currentGroup})`;
|
||||
log(`Heartbeat: Cycle ${cycleState.cycleNumber + 1}, ${state}. ${jobs}`);
|
||||
}, { timezone: 'UTC' });
|
||||
log('Registered cron job: heartbeat (hourly)');
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```typescript
|
||||
// Heartbeat every hour — logs cycle state and writes heartbeat file for Docker healthcheck
|
||||
cron.schedule('0 * * * *', () => {
|
||||
const currentGroup = cycleState.currentGroupIndex < PIPELINE_GROUPS.length
|
||||
? PIPELINE_GROUPS[cycleState.currentGroupIndex].name
|
||||
: 'none';
|
||||
const jobs = runningJobs.size > 0
|
||||
? `Running: ${[...runningJobs.keys()].join(', ')}`
|
||||
: 'No jobs running';
|
||||
const state = cycleState.waitingForCooldown
|
||||
? 'cooldown'
|
||||
: `group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length} (${currentGroup})`;
|
||||
log(`Heartbeat: Cycle ${cycleState.cycleNumber + 1}, ${state}. ${jobs}`);
|
||||
fs.writeFileSync(path.join(LOGS_DIR, 'scheduler.heartbeat'), new Date().toISOString());
|
||||
}, { timezone: 'UTC' });
|
||||
log('Registered cron job: heartbeat (hourly)');
|
||||
```
|
||||
|
||||
`fs` and `path` are already imported in `scheduler.ts`. `LOGS_DIR` is already defined as `'/app/logs'`.
|
||||
|
||||
- [ ] **Step 2: Verify the script compiles**
|
||||
|
||||
```bash
|
||||
cd /home/albert/Documents/ScraperControl
|
||||
npx tsc --noEmit
|
||||
```
|
||||
|
||||
Expected: no errors.
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/scheduler.ts
|
||||
git commit -m "fix: write heartbeat file for Docker healthcheck"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Fix scheduler healthcheck in docker-compose.yml
|
||||
|
||||
**Files:**
|
||||
- Modify: `docker-compose.yml`
|
||||
|
||||
- [ ] **Step 1: Replace the scheduler healthcheck**
|
||||
|
||||
Find this block in `docker-compose.yml` (around line 275):
|
||||
|
||||
```yaml
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pgrep -f scheduler.ts || exit 1"]
|
||||
interval: 60s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 30s
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```yaml
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . || exit 1"]
|
||||
interval: 90s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 90s
|
||||
```
|
||||
|
||||
The `find ... -mmin -120` check passes if the file exists and was modified within the last 120 minutes (2 hours). The `start_period: 90s` gives the scheduler time to reach its first hourly cron tick before Docker starts evaluating health.
|
||||
|
||||
- [ ] **Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add docker-compose.yml
|
||||
git commit -m "fix: replace pgrep healthcheck with heartbeat file check"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Deploy and verify
|
||||
|
||||
- [ ] **Step 1: Sync dev directory to Docker deployment**
|
||||
|
||||
```bash
|
||||
cd /home/albert/Documents/ScraperControl
|
||||
bash scripts/deploy-local.sh
|
||||
```
|
||||
|
||||
Expected: rsync output showing the three changed files transferred to `/opt/docker/scraper-control/`.
|
||||
|
||||
- [ ] **Step 2: Restart the two affected containers**
|
||||
|
||||
```bash
|
||||
docker compose -f /opt/docker/scraper-control/docker-compose.yml restart freesearch-enrichment scheduler
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify freesearch-enrichment is stable**
|
||||
|
||||
```bash
|
||||
docker logs scraper-control-freesearch-enrichment-1 --tail 30 -f
|
||||
```
|
||||
|
||||
Expected: logs showing "Waiting for FreeSearch to be reachable..." with retry messages if FreeSearch is still down, OR "FreeSearch health check: OK" and normal enrichment if FreeSearch is up. Container should NOT exit. Wait 2 minutes to confirm no restart.
|
||||
|
||||
- [ ] **Step 4: Confirm stale jobs were cleaned up**
|
||||
|
||||
```bash
|
||||
docker exec scraper-control-db-1 psql -U postgres -d nearestmass \
|
||||
-c "SELECT type, status, started_at, completed_at, error FROM background_jobs WHERE type = 'freesearch-enrichment' ORDER BY started_at DESC LIMIT 5;"
|
||||
```
|
||||
|
||||
Expected: the two previously-stuck `running` jobs from Mar 22 and Mar 26 now show `status = 'failed'` with `error = 'Container restarted'`.
|
||||
|
||||
- [ ] **Step 5: Verify scheduler heartbeat file is written**
|
||||
|
||||
Check if the file already exists from before (it won't — it's new). Wait for next hourly cron tick, or check after 60 minutes:
|
||||
|
||||
```bash
|
||||
docker exec scraper-control-scheduler-1 cat /app/logs/scheduler.heartbeat
|
||||
```
|
||||
|
||||
Expected: an ISO timestamp, e.g. `2026-03-28T14:00:00.000Z`
|
||||
|
||||
- [ ] **Step 6: Verify scheduler becomes healthy**
|
||||
|
||||
```bash
|
||||
docker ps --format "table {{.Names}}\t{{.Status}}" | grep scheduler
|
||||
```
|
||||
|
||||
Expected: `scraper-control-scheduler-1 Up X hours (healthy)` — but only after the first heartbeat fires AND Docker's `start_period` (90s) passes. If the next cron tick hasn't happened yet, `status` will remain `starting` or `unhealthy` until it does.
|
||||
|
||||
To force an immediate test without waiting for the cron:
|
||||
|
||||
```bash
|
||||
docker exec scraper-control-scheduler-1 bash -c \
|
||||
"date -u +%Y-%m-%dT%H:%M:%S.000Z > /app/logs/scheduler.heartbeat && echo 'written'"
|
||||
docker exec scraper-control-scheduler-1 \
|
||||
find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . && echo "PASS" || echo "FAIL"
|
||||
```
|
||||
|
||||
Expected: `written` then `PASS`.
|
||||
103
docs/superpowers/specs/2026-03-28-freesearch-stability-design.md
Normal file
103
docs/superpowers/specs/2026-03-28-freesearch-stability-design.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# FreeSearch Stability & Scheduler Healthcheck Fix
|
||||
|
||||
**Date:** 2026-03-28
|
||||
**Status:** Approved
|
||||
**Scope:** `scripts/enrich-with-freesearch.ts`, `scripts/scheduler.ts`, `docker-compose.yml`
|
||||
|
||||
---
|
||||
|
||||
## Problem Summary
|
||||
|
||||
Three related infrastructure reliability issues identified during health check:
|
||||
|
||||
1. **FreeSearch crash loop** — `freesearch-enrichment` container restarts every ~60s because startup health check calls `process.exit(1)` when FreeSearch API is unreachable. The circuit breaker (which handles mid-run outages) lives inside `runContinuous()` and is never reached.
|
||||
|
||||
2. **Stale running jobs** — Each container restart creates a new `freesearch-enrichment` DB job without cleaning up the previous `running` one. Two jobs from Mar 22 and Mar 26 are permanently stuck as `running`.
|
||||
|
||||
3. **Scheduler healthcheck failing** — `node:20-bookworm-slim` does not include `procps`/`pgrep`. The healthcheck command `pgrep -f scheduler.ts` exits 1 silently → scheduler shows as `unhealthy` despite working correctly.
|
||||
|
||||
---
|
||||
|
||||
## Fix 1: FreeSearch Startup Resilience
|
||||
|
||||
### Change
|
||||
|
||||
Replace the `process.exit(1)` startup health check in `main()` with a `waitForFreeSearch()` function.
|
||||
|
||||
### Behavior
|
||||
|
||||
- Polls `GET /api/health` with exponential backoff: 30s → 60s → 120s → 240s → cap at 300s (5 min)
|
||||
- Waits indefinitely — container stays alive until FreeSearch comes back
|
||||
- Logs each attempt: `"FreeSearch not reachable, retrying in 120s..."`
|
||||
- Logs recovery: `"FreeSearch is back, continuing..."`
|
||||
- Proceeds to job setup and `runContinuous()` once health check passes
|
||||
|
||||
### Stale job cleanup (same function)
|
||||
|
||||
Before creating a new DB job in `main()`, run a cleanup:
|
||||
|
||||
```typescript
|
||||
await prisma.backgroundJob.updateMany({
|
||||
where: { type: 'freesearch-enrichment', status: 'running' },
|
||||
data: { status: 'failed', error: 'Container restarted', completedAt: new Date() },
|
||||
});
|
||||
```
|
||||
|
||||
This fixes the two existing stuck jobs and prevents the pattern from recurring on future restarts.
|
||||
|
||||
### Files changed
|
||||
|
||||
- `scripts/enrich-with-freesearch.ts`: ~25 lines
|
||||
|
||||
---
|
||||
|
||||
## Fix 2: Scheduler Healthcheck
|
||||
|
||||
### Change
|
||||
|
||||
Replace `pgrep`-based healthcheck with a heartbeat file approach.
|
||||
|
||||
**In `scheduler.ts`:** Add `writeHeartbeat()` call inside the existing hourly cron handler. Writes current ISO timestamp to `/app/logs/scheduler.heartbeat`.
|
||||
|
||||
**In `docker-compose.yml`:** Replace healthcheck:
|
||||
|
||||
```yaml
|
||||
# Before
|
||||
test: ["CMD-SHELL", "pgrep -f scheduler.ts || exit 1"]
|
||||
interval: 60s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 30s
|
||||
|
||||
# After
|
||||
test: ["CMD-SHELL", "find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . || exit 1"]
|
||||
interval: 90s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 90s
|
||||
```
|
||||
|
||||
The `./logs` volume is already mounted. `start_period: 90s` avoids false alarms before the first cron tick.
|
||||
|
||||
### Files changed
|
||||
|
||||
- `scripts/scheduler.ts`: ~5 lines
|
||||
- `docker-compose.yml`: 4 lines
|
||||
|
||||
---
|
||||
|
||||
## Fix 3: Deploy
|
||||
|
||||
```bash
|
||||
bash scripts/deploy-local.sh
|
||||
docker compose -f /opt/docker/scraper-control/docker-compose.yml restart freesearch-enrichment scheduler
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- `freesearch-enrichment` container stays running even when FreeSearch is down, resumes enrichment when it comes back
|
||||
- No new stale `running` freesearch-enrichment jobs after container restarts
|
||||
- `scheduler` container shows as `healthy` in `docker ps`
|
||||
- No behavioral changes to enrichment logic itself
|
||||
6
next-env.d.ts
vendored
Normal file
6
next-env.d.ts
vendored
Normal file
@@ -0,0 +1,6 @@
|
||||
/// <reference types="next" />
|
||||
/// <reference types="next/image-types/global" />
|
||||
import "./.next/types/routes.d.ts";
|
||||
|
||||
// NOTE: This file should not be edited
|
||||
// see https://nextjs.org/docs/app/api-reference/config/typescript for more information.
|
||||
9
next.config.ts
Normal file
9
next.config.ts
Normal file
@@ -0,0 +1,9 @@
|
||||
import type { NextConfig } from 'next';
|
||||
|
||||
const nextConfig: NextConfig = {
|
||||
output: 'standalone',
|
||||
poweredByHeader: false,
|
||||
reactStrictMode: true,
|
||||
};
|
||||
|
||||
export default nextConfig;
|
||||
8677
package-lock.json
generated
Normal file
8677
package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
@@ -29,7 +29,7 @@
|
||||
"import:msze-info": "tsx scripts/import-msze-info.ts",
|
||||
"import:weekdaymasses": "tsx scripts/import-weekdaymasses.ts",
|
||||
"import:masstimes-api": "tsx scripts/import-masstimes-api.ts",
|
||||
"import:discovermass": "tsx scripts/import-discovermass.ts",
|
||||
"dedup:geo": "tsx scripts/find-geo-duplicates.ts",
|
||||
"postinstall": "prisma generate"
|
||||
},
|
||||
"dependencies": {
|
||||
|
||||
8
postcss.config.mjs
Normal file
8
postcss.config.mjs
Normal file
@@ -0,0 +1,8 @@
|
||||
/** @type {import('postcss-load-config').Config} */
|
||||
const config = {
|
||||
plugins: {
|
||||
'@tailwindcss/postcss': {},
|
||||
},
|
||||
};
|
||||
|
||||
export default config;
|
||||
@@ -42,10 +42,11 @@ model Church {
|
||||
messesInfoId String? @unique @map("messes_info_id")
|
||||
bohosluzbyId String? @unique @map("bohosluzby_id")
|
||||
miserendId String? @unique @map("miserend_id")
|
||||
kerknetId String? @unique @map("kerknet_id")
|
||||
gottesdienstzeitenId String? @unique @map("gottesdienstzeiten_id")
|
||||
discovermassId String? @unique @map("discovermass_id")
|
||||
gottesdienstzeitenId String? @unique @map("gottesdienstzeiten_id")
|
||||
kerknetId String? @unique @map("kerknet_id")
|
||||
buscarmisasNetworkId String? @unique @map("buscarmisas_network_id")
|
||||
gcatholicId String? @unique @map("gcatholic_id")
|
||||
claimed Boolean @default(false)
|
||||
claimedAt DateTime? @map("claimed_at")
|
||||
lastScrapedAt DateTime? @map("last_scraped_at")
|
||||
@@ -59,6 +60,7 @@ model Church {
|
||||
googleSearchedAt DateTime? @map("google_searched_at") // When Google Places enrichment was attempted
|
||||
createdAt DateTime @default(now()) @map("created_at")
|
||||
updatedAt DateTime @updatedAt @map("updated_at")
|
||||
parochiaSlug String? @map("parochia_slug")
|
||||
|
||||
dioceseId String? @map("diocese_id")
|
||||
|
||||
@@ -95,10 +97,11 @@ model Church {
|
||||
@@index([messesInfoId])
|
||||
@@index([bohosluzbyId])
|
||||
@@index([miserendId])
|
||||
@@index([kerknetId])
|
||||
@@index([gottesdienstzeitenId])
|
||||
@@index([discovermassId])
|
||||
@@index([gottesdienstzeitenId])
|
||||
@@index([kerknetId])
|
||||
@@index([buscarmisasNetworkId])
|
||||
@@index([gcatholicId])
|
||||
@@index([dioceseId])
|
||||
@@index([claimedByUserId])
|
||||
@@map("churches")
|
||||
|
||||
165
scripts/debug/analyze-enrichment-priority.ts
Normal file
165
scripts/debug/analyze-enrichment-priority.ts
Normal file
@@ -0,0 +1,165 @@
|
||||
import { config } from 'dotenv';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
|
||||
// Load .env.local first, then .env
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
const connectionString = process.env.DATABASE_URL;
|
||||
|
||||
if (!connectionString) {
|
||||
throw new Error('DATABASE_URL environment variable is not set');
|
||||
}
|
||||
|
||||
const pool = new Pool({ connectionString });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
interface CountryStats {
|
||||
country: string;
|
||||
totalChurches: number;
|
||||
withWebsite: number;
|
||||
withoutWebsite: number;
|
||||
websitePercent: number;
|
||||
needEnrichment: number;
|
||||
priority: number;
|
||||
}
|
||||
|
||||
async function analyzeEnrichmentPriority() {
|
||||
try {
|
||||
console.log('Analyzing enrichment priority by country...\n');
|
||||
|
||||
// Get all OSM churches grouped by country
|
||||
const churches = await prisma.church.findMany({
|
||||
where: {
|
||||
source: 'osm',
|
||||
},
|
||||
select: {
|
||||
country: true,
|
||||
hasWebsite: true,
|
||||
website: true,
|
||||
},
|
||||
});
|
||||
|
||||
// Group by country and calculate stats
|
||||
const byCountry = churches.reduce((acc, church) => {
|
||||
const country = church.country || 'Unknown';
|
||||
if (!acc[country]) {
|
||||
acc[country] = {
|
||||
country,
|
||||
totalChurches: 0,
|
||||
withWebsite: 0,
|
||||
withoutWebsite: 0,
|
||||
websitePercent: 0,
|
||||
needEnrichment: 0,
|
||||
priority: 0,
|
||||
};
|
||||
}
|
||||
|
||||
acc[country].totalChurches++;
|
||||
if (church.hasWebsite || church.website) {
|
||||
acc[country].withWebsite++;
|
||||
} else {
|
||||
acc[country].withoutWebsite++;
|
||||
acc[country].needEnrichment++;
|
||||
}
|
||||
|
||||
return acc;
|
||||
}, {} as Record<string, CountryStats>);
|
||||
|
||||
// Calculate percentages and priority score
|
||||
const stats = Object.values(byCountry).map((stat) => {
|
||||
stat.websitePercent = (stat.withWebsite / stat.totalChurches) * 100;
|
||||
|
||||
// Priority formula:
|
||||
// - Weight heavily on churches needing enrichment (80%)
|
||||
// - Weight on low website coverage (20%)
|
||||
// This favors large countries with low coverage
|
||||
const needWeight = stat.needEnrichment / 1000; // Normalize to thousands
|
||||
const coverageGap = 100 - stat.websitePercent; // How much coverage is missing
|
||||
stat.priority = needWeight * 0.8 + (coverageGap / 100) * needWeight * 0.2;
|
||||
|
||||
return stat;
|
||||
});
|
||||
|
||||
// Sort by priority (highest first)
|
||||
stats.sort((a, b) => b.priority - a.priority);
|
||||
|
||||
// Display results
|
||||
console.log('═══════════════════════════════════════════════════════════════════════════');
|
||||
console.log('ENRICHMENT PRIORITY RANKING');
|
||||
console.log('═══════════════════════════════════════════════════════════════════════════');
|
||||
console.log('');
|
||||
console.log('Priority formula: (churches_needing_enrichment * 0.8) + (coverage_gap * 0.2)');
|
||||
console.log('This favors countries with many churches and low website coverage.');
|
||||
console.log('');
|
||||
console.log('Rank | Country | Total | Need Enrichment | Coverage | Priority Score');
|
||||
console.log('─────┼─────────┼───────┼────────────────┼──────────┼────────────────');
|
||||
|
||||
stats.forEach((stat, index) => {
|
||||
const rank = String(index + 1).padStart(4);
|
||||
const country = stat.country.padEnd(7);
|
||||
const total = String(stat.totalChurches).padStart(5);
|
||||
const need = String(stat.needEnrichment).padStart(15);
|
||||
const coverage = `${stat.websitePercent.toFixed(1)}%`.padStart(8);
|
||||
const priority = stat.priority.toFixed(2).padStart(14);
|
||||
|
||||
console.log(`${rank} | ${country} | ${total} | ${need} | ${coverage} | ${priority}`);
|
||||
});
|
||||
|
||||
console.log('');
|
||||
console.log('═══════════════════════════════════════════════════════════════════════════');
|
||||
console.log('');
|
||||
|
||||
// Show top 10 with details
|
||||
console.log('TOP 10 COUNTRIES TO PRIORITIZE:');
|
||||
console.log('');
|
||||
|
||||
stats.slice(0, 10).forEach((stat, index) => {
|
||||
console.log(`${index + 1}. ${stat.country}`);
|
||||
console.log(` Total churches: ${stat.totalChurches.toLocaleString()}`);
|
||||
console.log(` Need enrichment: ${stat.needEnrichment.toLocaleString()} (${(100 - stat.websitePercent).toFixed(1)}% missing)`);
|
||||
console.log(` Current coverage: ${stat.websitePercent.toFixed(1)}%`);
|
||||
console.log(` Priority score: ${stat.priority.toFixed(2)}`);
|
||||
console.log('');
|
||||
});
|
||||
|
||||
// Calculate enrichment timeline
|
||||
const totalNeedEnrichment = stats.reduce((sum, s) => sum + s.needEnrichment, 0);
|
||||
const daysAtFullSpeed = Math.ceil(totalNeedEnrichment / 390);
|
||||
const monthsAtFullSpeed = (daysAtFullSpeed / 30).toFixed(1);
|
||||
|
||||
console.log('═══════════════════════════════════════════════════════════════════════════');
|
||||
console.log('ENRICHMENT TIMELINE');
|
||||
console.log('═══════════════════════════════════════════════════════════════════════════');
|
||||
console.log(`Total churches needing enrichment: ${totalNeedEnrichment.toLocaleString()}`);
|
||||
console.log(`At 390 churches/day (free tier): ${daysAtFullSpeed} days (~${monthsAtFullSpeed} months)`);
|
||||
console.log('');
|
||||
|
||||
// Output country priority order for the script
|
||||
console.log('═══════════════════════════════════════════════════════════════════════════');
|
||||
console.log('COUNTRY PRIORITY ORDER (for enrichment script)');
|
||||
console.log('═══════════════════════════════════════════════════════════════════════════');
|
||||
console.log('');
|
||||
console.log('const COUNTRY_PRIORITY = [');
|
||||
stats
|
||||
.filter((s) => s.needEnrichment > 0)
|
||||
.forEach((stat, index) => {
|
||||
const comma = index < stats.filter((s) => s.needEnrichment > 0).length - 1 ? ',' : '';
|
||||
console.log(` '${stat.country}'${comma} // ${stat.needEnrichment.toLocaleString()} churches`);
|
||||
});
|
||||
console.log('];');
|
||||
console.log('');
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error:', error);
|
||||
process.exit(1);
|
||||
} finally {
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
analyzeEnrichmentPriority();
|
||||
66
scripts/debug/check-2-real-bugs.ts
Normal file
66
scripts/debug/check-2-real-bugs.ts
Normal file
@@ -0,0 +1,66 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Check the 2 potentially real bugs
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function checkRealBugs() {
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
|
||||
console.log('=== 1. Iglesia de San Fernando (trying Spanish page) ===\n');
|
||||
|
||||
scraper.setCountry('ES');
|
||||
const spanishUrl = 'https://www.parroquiasanfernandomaspalomas.net/'; // Remove /de/
|
||||
const result1 = await scraper.scrape(spanishUrl);
|
||||
|
||||
console.log(`URL: ${spanishUrl}`);
|
||||
console.log(`Success: ${result1.success}`);
|
||||
console.log(`Schedules: ${result1.schedules.length}`);
|
||||
console.log(`Error: ${result1.error || 'none'}\n`);
|
||||
|
||||
if (result1.schedules.length > 0) {
|
||||
console.log('Sample schedules:');
|
||||
result1.schedules.slice(0, 5).forEach(s => {
|
||||
const days = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'];
|
||||
console.log(` ${days[s.dayOfWeek]} ${s.time} - ${s.language} ${s.massType}`);
|
||||
});
|
||||
}
|
||||
|
||||
console.log('\n=== 2. Kościół (Poland) ===\n');
|
||||
|
||||
scraper.setCountry('PL');
|
||||
const result2 = await scraper.scrape('http://parafialubojna.pl');
|
||||
|
||||
console.log(`Success: ${result2.success}`);
|
||||
console.log(`Schedules: ${result2.schedules.length}`);
|
||||
console.log(`Error: ${result2.error || 'none'}\n`);
|
||||
|
||||
if (result2.schedules.length > 0) {
|
||||
console.log('Sample schedules:');
|
||||
result2.schedules.slice(0, 5).forEach(s => {
|
||||
const days = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'];
|
||||
console.log(` ${days[s.dayOfWeek]} ${s.time} - ${s.language} ${s.massType}`);
|
||||
});
|
||||
} else if (result2.rawHtml) {
|
||||
const text = result2.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Look for Polish schedule keywords
|
||||
const scheduleIndex = text.indexOf('msze') || text.indexOf('msza') || text.indexOf('nabożeńst');
|
||||
if (scheduleIndex !== -1) {
|
||||
const snippet = text.substring(scheduleIndex, scheduleIndex + 300);
|
||||
console.log('Found schedule section:');
|
||||
console.log(snippet);
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
checkRealBugs().catch(console.error);
|
||||
79
scripts/debug/check-enrichment-detail.ts
Normal file
79
scripts/debug/check-enrichment-detail.ts
Normal file
@@ -0,0 +1,79 @@
|
||||
import { Pool } from 'pg';
|
||||
import * as dotenv from 'dotenv';
|
||||
import * as path from 'path';
|
||||
|
||||
// Load .env.local first (takes precedence), then .env
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
const pool = new Pool({
|
||||
connectionString: process.env.DATABASE_URL,
|
||||
});
|
||||
|
||||
async function checkEnrichmentDetail() {
|
||||
try {
|
||||
console.log('Connecting to database...\n');
|
||||
|
||||
// Check churches awaiting enrichment
|
||||
const pendingResult = await pool.query(`
|
||||
SELECT
|
||||
country,
|
||||
COUNT(*) as pending_count
|
||||
FROM churches
|
||||
WHERE google_place_id IS NULL
|
||||
GROUP BY country
|
||||
ORDER BY pending_count DESC
|
||||
LIMIT 20;
|
||||
`);
|
||||
|
||||
console.log('=== Churches Awaiting Enrichment (Top 20 Countries) ===');
|
||||
let totalPending = 0;
|
||||
pendingResult.rows.forEach((row) => {
|
||||
console.log(`${row.country}: ${row.pending_count} churches`);
|
||||
totalPending += parseInt(row.pending_count);
|
||||
});
|
||||
console.log(`\nTotal pending shown: ${totalPending}`);
|
||||
|
||||
// Check total stats
|
||||
const statsResult = await pool.query(`
|
||||
SELECT
|
||||
COUNT(*) as total_churches,
|
||||
COUNT(CASE WHEN google_place_id IS NOT NULL THEN 1 END) as enriched,
|
||||
COUNT(CASE WHEN google_place_id IS NULL THEN 1 END) as pending
|
||||
FROM churches;
|
||||
`);
|
||||
|
||||
console.log('\n=== Overall Stats ===');
|
||||
console.log(`Total churches: ${statsResult.rows[0].total_churches}`);
|
||||
console.log(`Enriched: ${statsResult.rows[0].enriched} (${((statsResult.rows[0].enriched / statsResult.rows[0].total_churches) * 100).toFixed(2)}%)`);
|
||||
console.log(`Pending: ${statsResult.rows[0].pending} (${((statsResult.rows[0].pending / statsResult.rows[0].total_churches) * 100).toFixed(2)}%)`);
|
||||
|
||||
// Check enrichment rate
|
||||
const rateResult = await pool.query(`
|
||||
SELECT
|
||||
DATE(updated_at) as date,
|
||||
COUNT(*) as enriched_count
|
||||
FROM churches
|
||||
WHERE google_place_id IS NOT NULL
|
||||
AND updated_at > NOW() - INTERVAL '7 days'
|
||||
GROUP BY DATE(updated_at)
|
||||
ORDER BY date DESC;
|
||||
`);
|
||||
|
||||
console.log('\n=== Enrichment Activity (Last 7 Days) ===');
|
||||
if (rateResult.rows.length === 0) {
|
||||
console.log('No enrichment activity in the last 7 days');
|
||||
} else {
|
||||
rateResult.rows.forEach((row) => {
|
||||
console.log(`${row.date}: ${row.enriched_count} churches`);
|
||||
});
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error checking enrichment detail:', error);
|
||||
} finally {
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
checkEnrichmentDetail();
|
||||
146
scripts/debug/check-enrichment-status.ts
Normal file
146
scripts/debug/check-enrichment-status.ts
Normal file
@@ -0,0 +1,146 @@
|
||||
import { config } from 'dotenv';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
|
||||
// Load .env.local first, then .env
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
const connectionString = process.env.DATABASE_URL;
|
||||
|
||||
if (!connectionString) {
|
||||
throw new Error('DATABASE_URL environment variable is not set');
|
||||
}
|
||||
|
||||
const pool = new Pool({ connectionString });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
async function checkEnrichmentStatus() {
|
||||
try {
|
||||
console.log('Checking enrichment status...\n');
|
||||
|
||||
// Overall stats
|
||||
const totalOSM = await prisma.church.count({
|
||||
where: { source: 'osm' },
|
||||
});
|
||||
|
||||
const enriched = await prisma.church.count({
|
||||
where: {
|
||||
source: 'osm',
|
||||
googlePlaceId: { not: null },
|
||||
},
|
||||
});
|
||||
|
||||
const withWebsite = await prisma.church.count({
|
||||
where: {
|
||||
source: 'osm',
|
||||
hasWebsite: true,
|
||||
},
|
||||
});
|
||||
|
||||
const needEnrichment = await prisma.church.count({
|
||||
where: {
|
||||
source: 'osm',
|
||||
hasWebsite: false,
|
||||
website: null,
|
||||
},
|
||||
});
|
||||
|
||||
// Recently enriched (last 24 hours)
|
||||
const yesterday = new Date();
|
||||
yesterday.setDate(yesterday.getDate() - 1);
|
||||
|
||||
const recentlyEnriched = await prisma.church.count({
|
||||
where: {
|
||||
source: 'osm',
|
||||
googlePlaceId: { not: null },
|
||||
updatedAt: { gte: yesterday },
|
||||
},
|
||||
});
|
||||
|
||||
// Get top 10 priority countries status
|
||||
const PRIORITY_COUNTRIES = ['FR', 'DE', 'ES', 'PL', 'BR', 'PT', 'PH', 'CZ', 'MX', 'HU'];
|
||||
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log('OVERALL ENRICHMENT STATUS');
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log(`Total OSM churches: ${totalOSM.toLocaleString()}`);
|
||||
console.log(`Churches with Google Place ID: ${enriched.toLocaleString()} (${((enriched / totalOSM) * 100).toFixed(2)}%)`);
|
||||
console.log(`Churches with websites: ${withWebsite.toLocaleString()} (${((withWebsite / totalOSM) * 100).toFixed(2)}%)`);
|
||||
console.log(`Need enrichment: ${needEnrichment.toLocaleString()} (${((needEnrichment / totalOSM) * 100).toFixed(2)}%)`);
|
||||
console.log('');
|
||||
console.log(`Recently enriched (24h): ${recentlyEnriched.toLocaleString()}`);
|
||||
console.log('');
|
||||
|
||||
// Priority countries breakdown
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log('TOP 10 PRIORITY COUNTRIES STATUS');
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log('');
|
||||
|
||||
for (const country of PRIORITY_COUNTRIES) {
|
||||
const total = await prisma.church.count({
|
||||
where: { source: 'osm', country },
|
||||
});
|
||||
|
||||
const countryEnriched = await prisma.church.count({
|
||||
where: {
|
||||
source: 'osm',
|
||||
country,
|
||||
googlePlaceId: { not: null },
|
||||
},
|
||||
});
|
||||
|
||||
const countryWithWebsite = await prisma.church.count({
|
||||
where: {
|
||||
source: 'osm',
|
||||
country,
|
||||
OR: [
|
||||
{ hasWebsite: true },
|
||||
{ googlePlaceId: { not: null } },
|
||||
],
|
||||
},
|
||||
});
|
||||
|
||||
const countryNeedEnrichment = await prisma.church.count({
|
||||
where: {
|
||||
source: 'osm',
|
||||
country,
|
||||
hasWebsite: false,
|
||||
website: null,
|
||||
},
|
||||
});
|
||||
|
||||
const websitePercent = (countryWithWebsite / total) * 100;
|
||||
const enrichedPercent = (countryEnriched / total) * 100;
|
||||
|
||||
console.log(`${country.padEnd(4)} | Total: ${String(total).padStart(6)} | Enriched: ${String(countryEnriched).padStart(5)} (${enrichedPercent.toFixed(1)}%) | With Website: ${String(countryWithWebsite).padStart(5)} (${websitePercent.toFixed(1)}%) | Need: ${String(countryNeedEnrichment).padStart(6)}`);
|
||||
}
|
||||
|
||||
console.log('');
|
||||
|
||||
// Estimate timeline
|
||||
const daysRemaining = Math.ceil(needEnrichment / 390);
|
||||
const monthsRemaining = (daysRemaining / 30).toFixed(1);
|
||||
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log('TIMELINE ESTIMATE');
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log(`At 390 churches/day:`);
|
||||
console.log(` Days remaining: ${daysRemaining} days`);
|
||||
console.log(` Months remaining: ~${monthsRemaining} months`);
|
||||
console.log(` Estimated completion: ${new Date(Date.now() + daysRemaining * 24 * 60 * 60 * 1000).toLocaleDateString()}`);
|
||||
console.log('');
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error:', error);
|
||||
process.exit(1);
|
||||
} finally {
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
checkEnrichmentStatus();
|
||||
78
scripts/debug/check-enrichment.ts
Normal file
78
scripts/debug/check-enrichment.ts
Normal file
@@ -0,0 +1,78 @@
|
||||
import { Pool } from 'pg';
|
||||
import * as dotenv from 'dotenv';
|
||||
import * as path from 'path';
|
||||
|
||||
// Load .env.local first (takes precedence), then .env
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
const pool = new Pool({
|
||||
connectionString: process.env.DATABASE_URL,
|
||||
});
|
||||
|
||||
async function checkEnrichment() {
|
||||
try {
|
||||
console.log('Connecting to database...');
|
||||
|
||||
// Check total enriched churches
|
||||
const totalResult = await pool.query(`
|
||||
SELECT
|
||||
COUNT(*) as total_enriched,
|
||||
COUNT(CASE WHEN updated_at > NOW() - INTERVAL '24 hours' THEN 1 END) as enriched_last_24h,
|
||||
MAX(updated_at) as last_enrichment
|
||||
FROM churches
|
||||
WHERE google_place_id IS NOT NULL;
|
||||
`);
|
||||
|
||||
console.log('\n=== Google Enrichment Summary ===');
|
||||
console.log(`Total churches with Google Place ID: ${totalResult.rows[0].total_enriched}`);
|
||||
console.log(`Enriched in last 24 hours: ${totalResult.rows[0].enriched_last_24h}`);
|
||||
console.log(`Last enrichment: ${totalResult.rows[0].last_enrichment}`);
|
||||
|
||||
// Check by country
|
||||
const countryResult = await pool.query(`
|
||||
SELECT
|
||||
country,
|
||||
COUNT(*) as enriched_count,
|
||||
COUNT(CASE WHEN updated_at > NOW() - INTERVAL '24 hours' THEN 1 END) as enriched_last_24h
|
||||
FROM churches
|
||||
WHERE google_place_id IS NOT NULL
|
||||
GROUP BY country
|
||||
ORDER BY enriched_last_24h DESC
|
||||
LIMIT 10;
|
||||
`);
|
||||
|
||||
console.log('\n=== Top Countries Enriched (Last 24h) ===');
|
||||
countryResult.rows.forEach((row) => {
|
||||
console.log(`${row.country}: ${row.enriched_last_24h} new / ${row.enriched_count} total`);
|
||||
});
|
||||
|
||||
// Check recent enrichments with details
|
||||
const recentResult = await pool.query(`
|
||||
SELECT
|
||||
name,
|
||||
city,
|
||||
country,
|
||||
google_place_id,
|
||||
updated_at
|
||||
FROM churches
|
||||
WHERE google_place_id IS NOT NULL
|
||||
AND updated_at > NOW() - INTERVAL '24 hours'
|
||||
ORDER BY updated_at DESC
|
||||
LIMIT 20;
|
||||
`);
|
||||
|
||||
console.log('\n=== Recent Enrichments (Last 24h, sample) ===');
|
||||
recentResult.rows.forEach((row) => {
|
||||
const timestamp = row.updated_at ? new Date(row.updated_at).toISOString() : 'unknown';
|
||||
console.log(`${row.name}, ${row.city}, ${row.country} - ${timestamp}`);
|
||||
});
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error checking enrichment:', error);
|
||||
} finally {
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
checkEnrichment();
|
||||
45
scripts/debug/check-german-office-hours.ts
Normal file
45
scripts/debug/check-german-office-hours.ts
Normal file
@@ -0,0 +1,45 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Check the full section text for German church to understand office hours pattern
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function checkGerman() {
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('DE');
|
||||
|
||||
const result = await scraper.scrape('https://www.alterpeter.de/');
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Find Monday section
|
||||
const montagIndex = text.indexOf('montag');
|
||||
if (montagIndex !== -1) {
|
||||
const montagContext = text.substring(montagIndex, montagIndex + 200);
|
||||
console.log('=== Monday (Montag) context ===');
|
||||
console.log(montagContext);
|
||||
console.log('');
|
||||
}
|
||||
|
||||
// Find Sunday section
|
||||
const sonntagIndex = text.indexOf('sonntag');
|
||||
if (sonntagIndex !== -1) {
|
||||
const sonntagContext = text.substring(sonntagIndex, sonntagIndex + 300);
|
||||
console.log('=== Sunday (Sonntag) context ===');
|
||||
console.log(sonntagContext);
|
||||
console.log('');
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
checkGerman().catch(console.error);
|
||||
51
scripts/debug/check-neon-poland.ts
Normal file
51
scripts/debug/check-neon-poland.ts
Normal file
@@ -0,0 +1,51 @@
|
||||
#!/usr/bin/env tsx
|
||||
import { config } from 'dotenv';
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
|
||||
// Load environment variables
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
async function main() {
|
||||
const connectionString = process.env.DATABASE_URL || '';
|
||||
console.log('DATABASE_URL:', connectionString.replace(/:[^:@]+@/, ':****@'));
|
||||
|
||||
const pool = new Pool({ connectionString });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
console.log('PrismaClient created:', !!prisma);
|
||||
console.log('prisma.churches:', !!prisma.churches);
|
||||
|
||||
await prisma.$connect();
|
||||
|
||||
const count = await prisma.churches.count({ where: { country: 'PL' } });
|
||||
console.log(`Poland churches in Neon: ${count}`);
|
||||
|
||||
const withSchedules = await prisma.churches.count({
|
||||
where: {
|
||||
country: 'PL',
|
||||
massSchedules: { some: {} }
|
||||
}
|
||||
});
|
||||
console.log(`With mass schedules: ${withSchedules}`);
|
||||
|
||||
// Sample a few churches
|
||||
const sample = await prisma.churches.findMany({
|
||||
where: { country: 'PL' },
|
||||
include: { massSchedules: true },
|
||||
take: 3
|
||||
});
|
||||
|
||||
console.log('\nSample churches:');
|
||||
for (const church of sample) {
|
||||
console.log(` - ${church.name} (${church.city}): ${church.massSchedules.length} schedules`);
|
||||
}
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
38
scripts/debug/check-niedziela-occurrences.ts
Normal file
38
scripts/debug/check-niedziela-occurrences.ts
Normal file
@@ -0,0 +1,38 @@
|
||||
#!/usr/bin/env tsx
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function check() {
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('PL');
|
||||
|
||||
const result = await scraper.scrape('http://parafialubojna.pl');
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
const niedziela_matches = [];
|
||||
let idx = 0;
|
||||
while ((idx = text.indexOf('niedziela', idx)) !== -1) {
|
||||
niedziela_matches.push({
|
||||
position: idx,
|
||||
context: text.substring(Math.max(0, idx-30), idx+70)
|
||||
});
|
||||
idx++;
|
||||
}
|
||||
|
||||
console.log(`niedziela occurrences: ${niedziela_matches.length}\n`);
|
||||
niedziela_matches.forEach((m, i) => {
|
||||
console.log(`Occurrence ${i+1} at position ${m.position}:`);
|
||||
console.log(` "${m.context}"`);
|
||||
console.log('');
|
||||
});
|
||||
}
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
check();
|
||||
34
scripts/debug/check-osm-counts.ts
Normal file
34
scripts/debug/check-osm-counts.ts
Normal file
@@ -0,0 +1,34 @@
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
import { Pool } from 'pg';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
|
||||
async function main() {
|
||||
const totalRes = await pool.query(`SELECT COUNT(*) as total FROM churches WHERE source = 'osm'`);
|
||||
console.log('Total OSM churches:', totalRes.rows[0].total);
|
||||
|
||||
const countryRes = await pool.query(`SELECT country, COUNT(*) as count FROM churches WHERE source = 'osm' AND country IS NOT NULL GROUP BY country ORDER BY count DESC LIMIT 40`);
|
||||
console.log('\nTop 40 countries by OSM church count:');
|
||||
for (const row of countryRes.rows) {
|
||||
console.log(` ${row.country}: ${row.count}`);
|
||||
}
|
||||
|
||||
// Check key countries that were under-imported
|
||||
const keyCountries = ['AT','HR','UA','RO','LV','BY','RS','BA','MK','AL','EE','GE','AM','RU','IN','JP','CA','US','MX','AR','CO','ID','CN'];
|
||||
const keyRes = await pool.query(`SELECT country, COUNT(*) as count FROM churches WHERE source = 'osm' AND country = ANY($1) GROUP BY country ORDER BY count DESC`, [keyCountries]);
|
||||
console.log('\nKey countries to check (were under-imported):');
|
||||
const found = new Map(keyRes.rows.map((r: any) => [r.country, r.count]));
|
||||
for (const c of keyCountries) {
|
||||
console.log(` ${c}: ${found.get(c) || 0}`);
|
||||
}
|
||||
|
||||
// Total countries
|
||||
const countriesRes = await pool.query(`SELECT COUNT(DISTINCT country) as total FROM churches WHERE source = 'osm'`);
|
||||
console.log(`\nTotal countries with OSM data: ${countriesRes.rows[0].total}`);
|
||||
|
||||
await pool.end();
|
||||
}
|
||||
main();
|
||||
88
scripts/debug/check-production-db.ts
Executable file
88
scripts/debug/check-production-db.ts
Executable file
@@ -0,0 +1,88 @@
|
||||
#!/usr/bin/env tsx
|
||||
|
||||
/**
|
||||
* Check production database (Neon) for data
|
||||
* Run with: npx tsx scripts/check-production-db.ts
|
||||
*/
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { config } from 'dotenv';
|
||||
|
||||
// Load environment variables (.env.local overrides .env)
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
const connectionString = process.env.DATABASE_URL;
|
||||
|
||||
if (!connectionString) {
|
||||
console.error('❌ DATABASE_URL not found in environment');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
console.log('🔍 Checking production database...');
|
||||
console.log('📍 Connection:', connectionString.includes('neon.tech') ? 'Neon (Production)' : 'localhost');
|
||||
|
||||
const pool = new Pool({ connectionString });
|
||||
|
||||
async function checkDatabase() {
|
||||
try {
|
||||
// Test connection
|
||||
console.log('\n1️⃣ Testing database connection...');
|
||||
await pool.query('SELECT NOW()');
|
||||
console.log('✅ Database connection successful');
|
||||
|
||||
// Check tables exist
|
||||
console.log('\n2️⃣ Checking tables...');
|
||||
const tablesResult = await pool.query(`
|
||||
SELECT table_name
|
||||
FROM information_schema.tables
|
||||
WHERE table_schema = 'public'
|
||||
ORDER BY table_name
|
||||
`);
|
||||
console.log(`✅ Found ${tablesResult.rows.length} tables:`, tablesResult.rows.map(r => r.table_name).join(', '));
|
||||
|
||||
// Check churches
|
||||
console.log('\n3️⃣ Checking churches...');
|
||||
const churchCount = await pool.query('SELECT COUNT(*) FROM "churches"');
|
||||
console.log(`📊 Churches: ${churchCount.rows[0].count}`);
|
||||
|
||||
if (parseInt(churchCount.rows[0].count) > 0) {
|
||||
const sampleChurch = await pool.query('SELECT id, name, city, state, latitude, longitude FROM "churches" LIMIT 1');
|
||||
console.log('📍 Sample church:', sampleChurch.rows[0]);
|
||||
} else {
|
||||
console.log('⚠️ No churches found in database!');
|
||||
}
|
||||
|
||||
// Check mass schedules
|
||||
console.log('\n4️⃣ Checking mass schedules...');
|
||||
const massCount = await pool.query('SELECT COUNT(*) FROM "mass_schedules"');
|
||||
console.log(`📊 Mass schedules: ${massCount.rows[0].count}`);
|
||||
|
||||
// Check liturgical days
|
||||
console.log('\n5️⃣ Checking liturgical days...');
|
||||
const liturgicalCount = await pool.query('SELECT COUNT(*) FROM "liturgical_days"');
|
||||
console.log(`📊 Liturgical days: ${liturgicalCount.rows[0].count}`);
|
||||
|
||||
// Check today's liturgical data
|
||||
const today = new Date().toISOString().split('T')[0];
|
||||
const todayData = await pool.query(
|
||||
'SELECT * FROM "liturgical_days" WHERE date = $1',
|
||||
[today]
|
||||
);
|
||||
if (todayData.rows.length > 0) {
|
||||
console.log(`✅ Today's liturgical data exists:`, todayData.rows[0].season);
|
||||
} else {
|
||||
console.log(`⚠️ No liturgical data for today (${today})`);
|
||||
}
|
||||
|
||||
console.log('\n✨ Database check complete!\n');
|
||||
|
||||
} catch (error) {
|
||||
console.error('❌ Error:', error);
|
||||
process.exit(1);
|
||||
} finally {
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
checkDatabase();
|
||||
164
scripts/debug/check-scraper-status.ts
Normal file
164
scripts/debug/check-scraper-status.ts
Normal file
@@ -0,0 +1,164 @@
|
||||
import { config } from 'dotenv';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
|
||||
// Load .env.local first, then .env
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
const connectionString = process.env.DATABASE_URL;
|
||||
|
||||
if (!connectionString) {
|
||||
throw new Error('DATABASE_URL environment variable is not set');
|
||||
}
|
||||
|
||||
const pool = new Pool({ connectionString });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
async function checkScraperStatus() {
|
||||
try {
|
||||
console.log('Checking mass schedule scraper status...\n');
|
||||
|
||||
// Overall church stats
|
||||
const totalChurches = await prisma.church.count();
|
||||
|
||||
const churchesWithWebsites = await prisma.church.count({
|
||||
where: {
|
||||
OR: [
|
||||
{ website: { not: null } },
|
||||
{ massScheduleUrl: { not: null } },
|
||||
],
|
||||
},
|
||||
});
|
||||
|
||||
const churchesScraped = await prisma.church.count({
|
||||
where: { lastScrapedAt: { not: null } },
|
||||
});
|
||||
|
||||
// Mass schedule stats
|
||||
const totalMassSchedules = await prisma.massSchedule.count();
|
||||
|
||||
const churchesWithSchedules = await prisma.church.count({
|
||||
where: {
|
||||
massSchedules: {
|
||||
some: {},
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
// Recently scraped (last 7 days)
|
||||
const weekAgo = new Date();
|
||||
weekAgo.setDate(weekAgo.getDate() - 7);
|
||||
|
||||
const recentlyScraped = await prisma.church.count({
|
||||
where: {
|
||||
lastScrapedAt: { gte: weekAgo },
|
||||
},
|
||||
});
|
||||
|
||||
// Get scraper sources
|
||||
const bySource = await prisma.church.groupBy({
|
||||
by: ['source'],
|
||||
_count: {
|
||||
id: true,
|
||||
},
|
||||
});
|
||||
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log('CHURCH DATA SOURCES');
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
bySource.forEach((source) => {
|
||||
const percent = ((source._count.id / totalChurches) * 100).toFixed(1);
|
||||
console.log(`${source.source.padEnd(12)} | ${String(source._count.id).padStart(7)} churches (${percent}%)`);
|
||||
});
|
||||
console.log('');
|
||||
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log('MASS SCHEDULE SCRAPING STATUS');
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log(`Total churches: ${totalChurches.toLocaleString()}`);
|
||||
console.log(`Churches with websites: ${churchesWithWebsites.toLocaleString()} (${((churchesWithWebsites / totalChurches) * 100).toFixed(1)}%)`);
|
||||
console.log(`Churches ever scraped: ${churchesScraped.toLocaleString()} (${((churchesScraped / totalChurches) * 100).toFixed(1)}%)`);
|
||||
console.log(`Churches with mass schedules: ${churchesWithSchedules.toLocaleString()} (${((churchesWithSchedules / totalChurches) * 100).toFixed(1)}%)`);
|
||||
console.log(`Total mass schedules: ${totalMassSchedules.toLocaleString()}`);
|
||||
console.log('');
|
||||
console.log(`Scraped in last 7 days: ${recentlyScraped.toLocaleString()}`);
|
||||
console.log('');
|
||||
|
||||
// Average schedules per church
|
||||
if (churchesWithSchedules > 0) {
|
||||
const avgSchedules = totalMassSchedules / churchesWithSchedules;
|
||||
console.log(`Average schedules per church: ${avgSchedules.toFixed(1)} masses/week`);
|
||||
console.log('');
|
||||
}
|
||||
|
||||
// Get sample of recently scraped churches
|
||||
const recentSample = await prisma.church.findMany({
|
||||
where: {
|
||||
lastScrapedAt: { not: null },
|
||||
},
|
||||
select: {
|
||||
name: true,
|
||||
city: true,
|
||||
state: true,
|
||||
country: true,
|
||||
lastScrapedAt: true,
|
||||
website: true,
|
||||
source: true,
|
||||
_count: {
|
||||
select: {
|
||||
massSchedules: true,
|
||||
},
|
||||
},
|
||||
},
|
||||
orderBy: { lastScrapedAt: 'desc' },
|
||||
take: 10,
|
||||
});
|
||||
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log('RECENTLY SCRAPED CHURCHES (Last 10)');
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
if (recentSample.length === 0) {
|
||||
console.log('No churches have been scraped yet.');
|
||||
} else {
|
||||
recentSample.forEach((church, index) => {
|
||||
const location = [church.city, church.state, church.country].filter(Boolean).join(', ');
|
||||
console.log(`${index + 1}. ${church.name} (${location})`);
|
||||
console.log(` Source: ${church.source}`);
|
||||
console.log(` Website: ${church.website || 'None'}`);
|
||||
console.log(` Last scraped: ${church.lastScrapedAt?.toLocaleString() || 'Never'}`);
|
||||
console.log(` Mass schedules: ${church._count.massSchedules}`);
|
||||
console.log('');
|
||||
});
|
||||
}
|
||||
|
||||
// Churches ready to scrape (have website, not scraped)
|
||||
const readyToScrape = await prisma.church.count({
|
||||
where: {
|
||||
OR: [
|
||||
{ website: { not: null } },
|
||||
{ massScheduleUrl: { not: null } },
|
||||
],
|
||||
lastScrapedAt: null,
|
||||
},
|
||||
});
|
||||
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log('SCRAPING POTENTIAL');
|
||||
console.log('═══════════════════════════════════════════════════════════════');
|
||||
console.log(`Churches ready to scrape: ${readyToScrape.toLocaleString()}`);
|
||||
console.log(` (have website, never scraped)`);
|
||||
console.log('');
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error:', error);
|
||||
process.exit(1);
|
||||
} finally {
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
checkScraperStatus();
|
||||
47
scripts/debug/compare-schemas.ts
Normal file
47
scripts/debug/compare-schemas.ts
Normal file
@@ -0,0 +1,47 @@
|
||||
import { Pool } from 'pg';
|
||||
|
||||
async function getColumns(pool: Pool, table: string) {
|
||||
const result = await pool.query(
|
||||
`SELECT column_name, data_type FROM information_schema.columns WHERE table_name = $1 ORDER BY ordinal_position`,
|
||||
[table]
|
||||
);
|
||||
return result.rows;
|
||||
}
|
||||
|
||||
async function run() {
|
||||
const nas = new Pool({ connectionString: 'postgresql://postgres:postgres@192.168.0.145:5434/nearestmass' });
|
||||
const neon = new Pool({
|
||||
connectionString: 'postgresql://neondb_owner:npg_sX8dxFg9KZIR@ep-plain-sky-ah15xa97-pooler.c-3.us-east-1.aws.neon.tech/neondb?sslmode=require',
|
||||
ssl: { rejectUnauthorized: false },
|
||||
});
|
||||
|
||||
for (const table of ['churches', 'mass_schedules', 'confession_schedules', 'adoration_schedules']) {
|
||||
const nasCols = await getColumns(nas, table);
|
||||
const neonCols = await getColumns(neon, table);
|
||||
|
||||
const nasNames = new Set(nasCols.map((c) => c.column_name));
|
||||
const neonNames = new Set(neonCols.map((c) => c.column_name));
|
||||
|
||||
const onlyNas = nasCols.filter((c) => !neonNames.has(c.column_name));
|
||||
const onlyNeon = neonCols.filter((c) => !nasNames.has(c.column_name));
|
||||
|
||||
if (onlyNas.length > 0 || onlyNeon.length > 0) {
|
||||
console.log(`\n=== ${table} ===`);
|
||||
if (onlyNas.length) {
|
||||
console.log(' NAS only:');
|
||||
for (const c of onlyNas) console.log(` - ${c.column_name} (${c.data_type})`);
|
||||
}
|
||||
if (onlyNeon.length) {
|
||||
console.log(' Neon only:');
|
||||
for (const c of onlyNeon) console.log(` - ${c.column_name} (${c.data_type})`);
|
||||
}
|
||||
} else {
|
||||
console.log(`\n=== ${table} === (schemas match)`);
|
||||
}
|
||||
}
|
||||
|
||||
await nas.end();
|
||||
await neon.end();
|
||||
}
|
||||
|
||||
run();
|
||||
48
scripts/debug/data-overview.ts
Normal file
48
scripts/debug/data-overview.ts
Normal file
@@ -0,0 +1,48 @@
|
||||
import { Pool } from 'pg';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
|
||||
async function main() {
|
||||
const c = await pool.connect();
|
||||
|
||||
const total = await c.query('SELECT count(*) FROM "Church"');
|
||||
console.log('\n=== DATABASE OVERVIEW ===');
|
||||
console.log('Churches total:', Number(total.rows[0].count).toLocaleString());
|
||||
|
||||
const withWebsite = await c.query('SELECT count(*) FROM "Church" WHERE website IS NOT NULL');
|
||||
console.log('With website:', Number(withWebsite.rows[0].count).toLocaleString());
|
||||
|
||||
const withSchedules = await c.query('SELECT count(DISTINCT "churchId") FROM "MassSchedule"');
|
||||
console.log('With mass schedules:', Number(withSchedules.rows[0].count).toLocaleString());
|
||||
|
||||
const enrichedGoogle = await c.query('SELECT count(*) FROM "Church" WHERE "googlePlaceId" IS NOT NULL');
|
||||
console.log('Google Places enriched:', Number(enrichedGoogle.rows[0].count).toLocaleString());
|
||||
|
||||
const totalSchedules = await c.query('SELECT count(*) FROM "MassSchedule"');
|
||||
console.log('Total mass schedules:', Number(totalSchedules.rows[0].count).toLocaleString());
|
||||
|
||||
const countries = await c.query('SELECT country, count(*) as cnt FROM "Church" GROUP BY country ORDER BY cnt DESC LIMIT 15');
|
||||
console.log('\n=== TOP COUNTRIES ===');
|
||||
for (const r of countries.rows) console.log(' ' + (r.country || '(null)') + ':', Number(r.cnt).toLocaleString());
|
||||
|
||||
const sources = await c.query('SELECT source, count(*) as cnt FROM "Church" GROUP BY source ORDER BY cnt DESC LIMIT 10');
|
||||
console.log('\n=== CHURCH SOURCES ===');
|
||||
for (const r of sources.rows) console.log(' ' + (r.source || '(null)') + ':', Number(r.cnt).toLocaleString());
|
||||
|
||||
const lastScrape = await c.query('SELECT "lastScrapedAt" FROM "Church" WHERE "lastScrapedAt" IS NOT NULL ORDER BY "lastScrapedAt" DESC LIMIT 1');
|
||||
console.log('\n=== LAST SCRAPE ===');
|
||||
console.log(lastScrape.rows[0]?.lastScrapedAt || 'No scrapes yet');
|
||||
|
||||
const jobs = await c.query('SELECT status, count(*) as cnt FROM "ScrapeJob" GROUP BY status ORDER BY cnt DESC');
|
||||
console.log('\n=== JOB STATUS ===');
|
||||
for (const r of jobs.rows) console.log(' ' + r.status + ':', Number(r.cnt).toLocaleString());
|
||||
|
||||
const schedulesByLang = await c.query('SELECT language, count(*) as cnt FROM "MassSchedule" GROUP BY language ORDER BY cnt DESC LIMIT 10');
|
||||
console.log('\n=== SCHEDULES BY LANGUAGE ===');
|
||||
for (const r of schedulesByLang.rows) console.log(' ' + (r.language || '(null)') + ':', Number(r.cnt).toLocaleString());
|
||||
|
||||
c.release();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch(e => { console.error(e.message); process.exit(1); });
|
||||
58
scripts/debug/debug-french-page.ts
Normal file
58
scripts/debug/debug-french-page.ts
Normal file
@@ -0,0 +1,58 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Debug a specific French page to see why scraping failed
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function debugPage() {
|
||||
const url = 'https://www.chemin-neuf.fr/'; // Last failed church
|
||||
console.log(`Debugging: ${url}\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('FR');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
console.log(`Success: ${result.success}`);
|
||||
console.log(`Schedules found: ${result.schedules.length}`);
|
||||
if (result.error) console.log(`Error: ${result.error}`);
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
console.log('\n=== Page Text Sample (first 2000 chars) ===');
|
||||
console.log(text.substring(0, 2000));
|
||||
console.log('\n');
|
||||
|
||||
// Check for French day names
|
||||
const frenchDays = ['dimanche', 'lundi', 'mardi', 'mercredi', 'jeudi', 'vendredi', 'samedi'];
|
||||
console.log('=== French day names found ===');
|
||||
for (const day of frenchDays) {
|
||||
if (text.includes(day)) {
|
||||
console.log(`✓ Found: ${day}`);
|
||||
}
|
||||
}
|
||||
|
||||
// Check for time patterns
|
||||
console.log('\n=== Time patterns (sample) ===');
|
||||
const timeRegex = /\d{1,2}[h:\.]\s*\d{0,2}\s*(?:AM|PM|am|pm|Uhr|uur|h)?/g;
|
||||
const times = text.match(timeRegex);
|
||||
if (times) {
|
||||
console.log(`Found ${times.length} time-like patterns:`);
|
||||
console.log(times.slice(0, 20).join(', '));
|
||||
} else {
|
||||
console.log('No time patterns found');
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
debugPage().catch(console.error);
|
||||
65
scripts/debug/debug-german-duplicates.ts
Normal file
65
scripts/debug/debug-german-duplicates.ts
Normal file
@@ -0,0 +1,65 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Debug why German church has duplicate schedules
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
// Temporarily patch GenericScraper to log sections
|
||||
const originalParse = GenericScraper.prototype['parseSchedules'];
|
||||
GenericScraper.prototype['parseSchedules'] = function(html: string) {
|
||||
const text = html
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Call findScheduleSections and log result
|
||||
const sections = this['findScheduleSections'](text);
|
||||
|
||||
console.log('\n=== Sections found ===\n');
|
||||
const dayNames = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'];
|
||||
sections.forEach((section: any, i: number) => {
|
||||
console.log(`Section ${i + 1}: ${dayNames[section.day]} (day ${section.day})`);
|
||||
console.log(` Text preview: "${section.text.substring(0, 100)}..."`);
|
||||
});
|
||||
console.log(`\nTotal sections: ${sections.length}\n`);
|
||||
|
||||
// Continue with normal processing
|
||||
const result = originalParse.call(this, html);
|
||||
|
||||
console.log(`\n=== Extracted times per section ===\n`);
|
||||
const schedsByDay: Record<number, typeof result> = {};
|
||||
for (const sched of result) {
|
||||
if (!schedsByDay[sched.dayOfWeek]) schedsByDay[sched.dayOfWeek] = [];
|
||||
schedsByDay[sched.dayOfWeek].push(sched);
|
||||
}
|
||||
|
||||
for (let i = 0; i < 7; i++) {
|
||||
if (schedsByDay[i]) {
|
||||
console.log(`${dayNames[i]}: ${schedsByDay[i].map(s => s.time).join(', ')}`);
|
||||
}
|
||||
}
|
||||
|
||||
return result;
|
||||
};
|
||||
|
||||
async function testGerman() {
|
||||
const url = 'https://www.alterpeter.de/';
|
||||
console.log(`Testing: ${url}`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('DE');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
console.log(`\n=== Final Result ===`);
|
||||
console.log(`Success: ${result.success}`);
|
||||
console.log(`Total schedules: ${result.schedules.length}`);
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
testGerman().catch(console.error);
|
||||
44
scripts/debug/debug-masstimes.ts
Normal file
44
scripts/debug/debug-masstimes.ts
Normal file
@@ -0,0 +1,44 @@
|
||||
import { chromium } from 'playwright';
|
||||
|
||||
async function main() {
|
||||
const browser = await chromium.launch({ headless: true });
|
||||
const page = await browser.newPage();
|
||||
|
||||
const url = 'https://masstimes.org/search?lat=32.7765&lng=-79.9311&type=parish';
|
||||
console.log('Loading:', url);
|
||||
|
||||
await page.goto(url, { waitUntil: 'networkidle', timeout: 60000 });
|
||||
|
||||
// Wait for Angular to render
|
||||
await page.waitForTimeout(5000);
|
||||
|
||||
// Take screenshot
|
||||
await page.screenshot({ path: '/tmp/masstimes-debug.png', fullPage: true });
|
||||
console.log('Screenshot saved to /tmp/masstimes-debug.png');
|
||||
|
||||
// Get page HTML
|
||||
const html = await page.content();
|
||||
console.log('\n--- PAGE HTML (first 5000 chars) ---\n');
|
||||
console.log(html.substring(0, 5000));
|
||||
|
||||
// Try to find any visible text that looks like church names
|
||||
const visibleText = await page.evaluate(() => {
|
||||
const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);
|
||||
const texts: string[] = [];
|
||||
let node;
|
||||
while ((node = walker.nextNode())) {
|
||||
const text = node.textContent?.trim();
|
||||
if (text && text.length > 10 && text.length < 100) {
|
||||
texts.push(text);
|
||||
}
|
||||
}
|
||||
return texts.slice(0, 50);
|
||||
});
|
||||
|
||||
console.log('\n--- VISIBLE TEXT SNIPPETS ---\n');
|
||||
visibleText.forEach((t, i) => console.log(`${i + 1}. ${t}`));
|
||||
|
||||
await browser.close();
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
74
scripts/debug/debug-paroquia-paz.ts
Normal file
74
scripts/debug/debug-paroquia-paz.ts
Normal file
@@ -0,0 +1,74 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Deep dive into Paróquia da Paz parsing bug
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function debugPaz() {
|
||||
const url = 'https://www.paroquiadapaz.org.br/';
|
||||
console.log(`Debugging: ${url}\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('BR');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
console.log(`Success: ${result.success}`);
|
||||
console.log(`Schedules: ${result.schedules.length}\n`);
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Find where days appear
|
||||
console.log('=== Finding day + time patterns ===\n');
|
||||
|
||||
const days = ['domingo', 'segunda', 'terça', 'terca', 'quarta', 'quinta', 'sexta', 'sábado', 'sabado'];
|
||||
|
||||
for (const day of days) {
|
||||
const dayIndex = text.indexOf(day);
|
||||
if (dayIndex !== -1) {
|
||||
// Show context around the day (100 chars before and 200 after)
|
||||
const before = Math.max(0, dayIndex - 100);
|
||||
const after = Math.min(text.length, dayIndex + 200);
|
||||
const snippet = text.substring(before, after);
|
||||
|
||||
console.log(`${day.toUpperCase()}:`);
|
||||
console.log(` Position: ${dayIndex}`);
|
||||
console.log(` Context: ...${snippet}...`);
|
||||
console.log('');
|
||||
}
|
||||
}
|
||||
|
||||
// Check for "h" time format specifically
|
||||
console.log('\n=== Checking "h" time format ===');
|
||||
const hTimeRegex = /(\d{1,2})h(\d{2})?/g;
|
||||
const hTimes = text.match(hTimeRegex);
|
||||
if (hTimes) {
|
||||
console.log(`Found ${hTimes.length} "h" format times:`);
|
||||
console.log(hTimes.slice(0, 30).join(', '));
|
||||
}
|
||||
|
||||
// Look for schedule structure
|
||||
console.log('\n=== Looking for schedule structure ===');
|
||||
const scheduleKeywords = ['horário', 'horario', 'missa', 'missas', 'santa missa'];
|
||||
for (const keyword of scheduleKeywords) {
|
||||
const index = text.indexOf(keyword);
|
||||
if (index !== -1) {
|
||||
const snippet = text.substring(index, Math.min(text.length, index + 500));
|
||||
console.log(`\nFound "${keyword}" at position ${index}:`);
|
||||
console.log(snippet.substring(0, 300));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
debugPaz().catch(console.error);
|
||||
150
scripts/debug/debug-parsing-bugs.ts
Normal file
150
scripts/debug/debug-parsing-bugs.ts
Normal file
@@ -0,0 +1,150 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Debug the 5 parsing bugs identified in top 5 test
|
||||
*/
|
||||
|
||||
import { config } from 'dotenv';
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
// The churches with parsing bugs
|
||||
const BUG_CHURCHES = [
|
||||
{ name: 'St. Marien', country: 'DE', searchTerm: 'St. Marien' },
|
||||
{ name: 'Santuario de Manalagua', country: 'ES', searchTerm: 'Santuario de Manalagua' },
|
||||
{ name: 'Kościół pw. Najświętszego Serca', country: 'PL', searchTerm: 'Najświętszego Serca Pana Jez' },
|
||||
{ name: 'Paróquia de Nossa Senhora do Desterro', country: 'BR', searchTerm: 'Nossa Senhora do Desterro' },
|
||||
{ name: 'Paróquia da Paz', country: 'BR', searchTerm: 'Paróquia da Paz' },
|
||||
];
|
||||
|
||||
async function debugBugs() {
|
||||
console.log('Debugging parsing bugs...\n');
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
|
||||
for (const bug of BUG_CHURCHES) {
|
||||
console.log('═'.repeat(80));
|
||||
console.log(`BUG: ${bug.name} (${bug.country})`);
|
||||
console.log('═'.repeat(80));
|
||||
|
||||
const church = await prisma.church.findFirst({
|
||||
where: {
|
||||
country: bug.country,
|
||||
name: { contains: bug.searchTerm },
|
||||
website: { not: null },
|
||||
},
|
||||
});
|
||||
|
||||
if (!church) {
|
||||
console.log(`❌ Church not found in database\n`);
|
||||
continue;
|
||||
}
|
||||
|
||||
console.log(`Church: ${church.name}`);
|
||||
console.log(`URL: ${church.website}\n`);
|
||||
|
||||
scraper.setCountry(bug.country);
|
||||
|
||||
try {
|
||||
const result = await scraper.scrape(church.website!);
|
||||
|
||||
console.log(`Success: ${result.success}`);
|
||||
console.log(`Schedules found: ${result.schedules.length}`);
|
||||
if (result.error) console.log(`Error: ${result.error}`);
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
console.log('\n--- Text Sample (first 1000 chars) ---');
|
||||
console.log(text.substring(0, 1000));
|
||||
|
||||
// Check for day names
|
||||
console.log('\n--- Day Names Found ---');
|
||||
const dayPatterns: Record<string, string[]> = {
|
||||
DE: ['sonntag', 'montag', 'dienstag', 'mittwoch', 'donnerstag', 'freitag', 'samstag'],
|
||||
ES: ['domingo', 'lunes', 'martes', 'miércoles', 'miercoles', 'jueves', 'viernes', 'sábado', 'sabado'],
|
||||
PL: ['niedziela', 'poniedziałek', 'poniedzialek', 'wtorek', 'środa', 'sroda', 'czwartek', 'piątek', 'piatek', 'sobota'],
|
||||
BR: ['domingo', 'segunda', 'terça', 'terca', 'quarta', 'quinta', 'sexta', 'sábado', 'sabado'],
|
||||
};
|
||||
|
||||
const days = dayPatterns[bug.country] || [];
|
||||
const foundDays: string[] = [];
|
||||
for (const day of days) {
|
||||
if (text.includes(day)) {
|
||||
foundDays.push(day);
|
||||
}
|
||||
}
|
||||
console.log(`Found: ${foundDays.join(', ') || 'none'}`);
|
||||
|
||||
// Check for time patterns
|
||||
console.log('\n--- Time Patterns Found ---');
|
||||
const timeRegex = /\d{1,2}[h:\.]\s*\d{0,2}\s*(?:h|uhr)?/gi;
|
||||
const times = text.match(timeRegex);
|
||||
if (times) {
|
||||
const uniqueTimes = [...new Set(times)].slice(0, 20);
|
||||
console.log(`Found ${times.length} time patterns (showing first 20 unique):`);
|
||||
console.log(uniqueTimes.join(', '));
|
||||
} else {
|
||||
console.log('No time patterns found');
|
||||
}
|
||||
|
||||
// Look for specific mass schedule keywords
|
||||
console.log('\n--- Mass Schedule Keywords ---');
|
||||
const keywords: Record<string, string[]> = {
|
||||
DE: ['gottesdienst', 'messe', 'heilige messe', 'messzeiten'],
|
||||
ES: ['misa', 'horario', 'eucaristía', 'eucaristia'],
|
||||
PL: ['msza', 'msze', 'nabożeństwo', 'nabozenstwo'],
|
||||
BR: ['missa', 'horário', 'horario', 'eucaristia'],
|
||||
};
|
||||
|
||||
const countryKeywords = keywords[bug.country] || [];
|
||||
const foundKeywords: string[] = [];
|
||||
for (const keyword of countryKeywords) {
|
||||
if (text.includes(keyword)) {
|
||||
foundKeywords.push(keyword);
|
||||
}
|
||||
}
|
||||
console.log(`Found: ${foundKeywords.join(', ') || 'none'}`);
|
||||
|
||||
// Look for specific problematic patterns
|
||||
console.log('\n--- Looking for edge cases ---');
|
||||
|
||||
// Check if times and days are separated (not in same section)
|
||||
const hasTimeBeforeDays = text.indexOf(foundDays[0] || 'zzz') > text.indexOf((times || [])[0] || 'aaa');
|
||||
console.log(`Times come before days: ${hasTimeBeforeDays ? 'YES (potential issue)' : 'no'}`);
|
||||
|
||||
// Check for table structures
|
||||
const hasTables = text.includes('colspan') || text.includes('rowspan') || (text.match(/\s+\|\s+/g)?.length || 0) > 5;
|
||||
console.log(`Likely table format: ${hasTables ? 'YES (may need special handling)' : 'no'}`);
|
||||
|
||||
// Check for multiple languages on same page
|
||||
const hasMultiLang = (text.match(/english|español|espanol|portuguese|português|portugues|deutsch|polski/gi)?.length || 0) > 1;
|
||||
console.log(`Multiple languages: ${hasMultiLang ? 'YES (may confuse parser)' : 'no'}`);
|
||||
}
|
||||
|
||||
console.log('\n');
|
||||
} catch (err: any) {
|
||||
console.log(`❌ ERROR: ${err.message}\n`);
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
debugBugs().catch(console.error);
|
||||
98
scripts/debug/debug-paz-full-flow.ts
Normal file
98
scripts/debug/debug-paz-full-flow.ts
Normal file
@@ -0,0 +1,98 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Debug the full parsing flow with section detection
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
import { getDayNamesForCountry, buildDayPatterns } from '../../src/scrapers/i18n/day-names';
|
||||
|
||||
async function debugFullFlow() {
|
||||
const url = 'https://www.paroquiadapaz.org.br/';
|
||||
console.log(`Debugging: ${url}\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('BR');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
if (!result.rawHtml) {
|
||||
console.log('No HTML received');
|
||||
await scraper.close();
|
||||
return;
|
||||
}
|
||||
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Find the schedule section
|
||||
const scheduleIndex = text.indexOf('segundas, terças');
|
||||
if (scheduleIndex === -1) {
|
||||
console.log('Schedule text not found!');
|
||||
await scraper.close();
|
||||
return;
|
||||
}
|
||||
|
||||
const snippet = text.substring(scheduleIndex, scheduleIndex + 500);
|
||||
console.log('Schedule snippet from actual HTML:');
|
||||
console.log(snippet);
|
||||
console.log('\n');
|
||||
|
||||
// Now test section matching on actual text
|
||||
const dayConfigs = getDayNamesForCountry('BR');
|
||||
const dayPatterns = buildDayPatterns(dayConfigs);
|
||||
const sortedDayNames = Object.keys(dayPatterns).sort((a, b) => b.length - a.length);
|
||||
const allDayNamesPattern = sortedDayNames.map(d => d.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')).join('|');
|
||||
|
||||
console.log('=== Testing sábados and domingos matches ===\n');
|
||||
|
||||
// Test sábados
|
||||
const sabadosRegex = new RegExp(
|
||||
`(?:^|\\s|[,;:])sábados[:\\s]+([^]*?)(?=${allDayNamesPattern}|$)`,
|
||||
'i'
|
||||
);
|
||||
const sabadosMatch = snippet.match(sabadosRegex);
|
||||
console.log('sábados match:', sabadosMatch ? `Found: "${sabadosMatch[1].substring(0, 50)}"` : 'Not found');
|
||||
|
||||
// Test sabados (no accent)
|
||||
const sabadosRegex2 = new RegExp(
|
||||
`(?:^|\\s|[,;:])sabados[:\\s]+([^]*?)(?=${allDayNamesPattern}|$)`,
|
||||
'i'
|
||||
);
|
||||
const sabadosMatch2 = snippet.match(sabadosRegex2);
|
||||
console.log('sabados match:', sabadosMatch2 ? `Found: "${sabadosMatch2[1].substring(0, 50)}"` : 'Not found');
|
||||
|
||||
// Test domingos
|
||||
const domingosRegex = new RegExp(
|
||||
`(?:^|\\s|[,;:])domingos[:\\s]+([^]*?)(?=${allDayNamesPattern}|$)`,
|
||||
'i'
|
||||
);
|
||||
const domingosMatch = snippet.match(domingosRegex);
|
||||
console.log('domingos match:', domingosMatch ? `Found: "${domingosMatch[1].substring(0, 50)}"` : 'Not found');
|
||||
|
||||
console.log('\n=== Final parsed schedules ===\n');
|
||||
console.log(`Total: ${result.schedules.length}`);
|
||||
|
||||
const byDay: Record<number, typeof result.schedules> = {};
|
||||
for (const sched of result.schedules) {
|
||||
if (!byDay[sched.dayOfWeek]) byDay[sched.dayOfWeek] = [];
|
||||
byDay[sched.dayOfWeek].push(sched);
|
||||
}
|
||||
|
||||
const dayNames = ['Domingo', 'Segunda', 'Terça', 'Quarta', 'Quinta', 'Sexta', 'Sábado'];
|
||||
for (let i = 0; i < 7; i++) {
|
||||
if (byDay[i]) {
|
||||
console.log(`${dayNames[i]}: ${byDay[i].length} schedules`);
|
||||
} else {
|
||||
console.log(`${dayNames[i]}: 0 schedules ❌`);
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
debugFullFlow().catch(console.error);
|
||||
56
scripts/debug/debug-paz-sections.ts
Normal file
56
scripts/debug/debug-paz-sections.ts
Normal file
@@ -0,0 +1,56 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Debug which sections are being found
|
||||
*/
|
||||
|
||||
import { getDayNamesForCountry, buildDayPatterns } from '../../src/scrapers/i18n/day-names';
|
||||
|
||||
// Simulate the exact text from the page
|
||||
const scheduleText = `
|
||||
horário das missas igreja matriz de santo antônio
|
||||
segundas, terças, quartas e sextas-feiras: 16h e 18h.
|
||||
quintas-feiras: 16h e 19h (adoração ao santíssimo – 18h).
|
||||
sábados: 8h, 16h e 18h.
|
||||
domingos: 8h, 11h, 16h, 18h e 20h.
|
||||
`.toLowerCase();
|
||||
|
||||
console.log('Text to parse:');
|
||||
console.log(scheduleText);
|
||||
console.log('');
|
||||
|
||||
const dayConfigs = getDayNamesForCountry('BR');
|
||||
const dayPatterns = buildDayPatterns(dayConfigs);
|
||||
const sortedDayNames = Object.keys(dayPatterns).sort((a, b) => b.length - a.length);
|
||||
const allDayNamesPattern = sortedDayNames.map(d => d.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')).join('|');
|
||||
|
||||
console.log('=== COMMA-SEPARATED GROUP MATCHING ===\n');
|
||||
|
||||
const dayGroupRegex = new RegExp(
|
||||
`((?:${allDayNamesPattern})(?:[,\\s]+(?:e|and|et|und|y)?\\s*(?:${allDayNamesPattern}))+)[:\\s]+([^]*?)(?=(?:${allDayNamesPattern})|$)`,
|
||||
'gi'
|
||||
);
|
||||
|
||||
let groupMatch;
|
||||
let matchCount = 0;
|
||||
while ((groupMatch = dayGroupRegex.exec(scheduleText)) !== null) {
|
||||
matchCount++;
|
||||
console.log(`Match #${matchCount}:`);
|
||||
console.log(` Day group: "${groupMatch[1]}"`);
|
||||
console.log(` Time text: "${groupMatch[2]}"`);
|
||||
console.log('');
|
||||
}
|
||||
|
||||
console.log('=== INDIVIDUAL DAY MATCHING ===\n');
|
||||
|
||||
for (const [dayName, dayIndex] of Object.entries(dayPatterns)) {
|
||||
const escaped = dayName.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
|
||||
const regex = new RegExp(
|
||||
`(?:^|\\s|[,;:])${escaped}[:\\s]+([^]*?)(?=${allDayNamesPattern}|$)`,
|
||||
'i'
|
||||
);
|
||||
const match = scheduleText.match(regex);
|
||||
if (match) {
|
||||
console.log(`Found ${dayName} (day ${dayIndex}):`);
|
||||
console.log(` Time text: "${match[1].substring(0, 100)}"`);
|
||||
}
|
||||
}
|
||||
85
scripts/debug/debug-paz-with-logging.ts
Normal file
85
scripts/debug/debug-paz-with-logging.ts
Normal file
@@ -0,0 +1,85 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Debug Paróquia da Paz with added logging
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
import { getDayNamesForCountry, buildDayPatterns } from '../../src/scrapers/i18n/day-names';
|
||||
|
||||
async function debugPazWithLogging() {
|
||||
const url = 'https://www.paroquiadapaz.org.br/';
|
||||
console.log(`Debugging: ${url}\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('BR');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
console.log(`Success: ${result.success}`);
|
||||
console.log(`Schedules: ${result.schedules.length}\n`);
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Test the regex pattern manually
|
||||
console.log('=== Testing comma-separated day grouping regex ===\n');
|
||||
|
||||
const dayConfigs = getDayNamesForCountry('BR');
|
||||
const dayPatterns = buildDayPatterns(dayConfigs);
|
||||
const sortedDayNames = Object.keys(dayPatterns).sort((a, b) => b.length - a.length);
|
||||
const allDayNamesPattern = sortedDayNames.map(d => d.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')).join('|');
|
||||
|
||||
console.log('Day patterns:', Object.keys(dayPatterns).join(', '));
|
||||
console.log('');
|
||||
|
||||
// The exact regex from the code
|
||||
const dayGroupRegex = new RegExp(
|
||||
`((?:${allDayNamesPattern})(?:[,\\s]+(?:e|and|et|und|y)?\\s*(?:${allDayNamesPattern}))+)[:\\s]+([^]*?)(?=(?:${allDayNamesPattern})|$)`,
|
||||
'gi'
|
||||
);
|
||||
|
||||
console.log('Regex pattern:', dayGroupRegex.source.substring(0, 200) + '...\n');
|
||||
|
||||
let groupMatch;
|
||||
let matchCount = 0;
|
||||
while ((groupMatch = dayGroupRegex.exec(text)) !== null) {
|
||||
matchCount++;
|
||||
console.log(`Match #${matchCount}:`);
|
||||
console.log(` Full match: "${groupMatch[0].substring(0, 100)}"`);
|
||||
console.log(` Day group: "${groupMatch[1]}"`);
|
||||
console.log(` Time text: "${groupMatch[2].substring(0, 50)}"`);
|
||||
console.log('');
|
||||
}
|
||||
|
||||
if (matchCount === 0) {
|
||||
console.log('No matches found!\n');
|
||||
|
||||
// Try to find the schedule text manually
|
||||
const scheduleIndex = text.indexOf('segundas, terças');
|
||||
if (scheduleIndex !== -1) {
|
||||
const snippet = text.substring(scheduleIndex, scheduleIndex + 300);
|
||||
console.log('Found schedule text at position', scheduleIndex);
|
||||
console.log('Snippet:', snippet);
|
||||
console.log('');
|
||||
|
||||
// Test if individual day names are matching
|
||||
console.log('Testing individual day name matches in snippet:');
|
||||
for (const dayName of sortedDayNames.slice(0, 10)) {
|
||||
if (snippet.includes(dayName)) {
|
||||
console.log(` ✓ Found: ${dayName}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
debugPazWithLogging().catch(console.error);
|
||||
85
scripts/debug/debug-polish-church.ts
Normal file
85
scripts/debug/debug-polish-church.ts
Normal file
@@ -0,0 +1,85 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Debug Polish church in detail
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
import { getDayNamesForCountry, buildDayPatterns } from '../../src/scrapers/i18n/day-names';
|
||||
|
||||
async function debugPolish() {
|
||||
const url = 'http://parafialubojna.pl';
|
||||
console.log(`Debugging: ${url}\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('PL');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
console.log(`Success: ${result.success}`);
|
||||
console.log(`Schedules found: ${result.schedules.length}\n`);
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Find the schedule section
|
||||
const scheduleIndex = text.indexOf('msze święte') || text.indexOf('msze swiete');
|
||||
if (scheduleIndex !== -1) {
|
||||
const snippet = text.substring(scheduleIndex, scheduleIndex + 500);
|
||||
console.log('Schedule section:');
|
||||
console.log(snippet);
|
||||
console.log('\n');
|
||||
|
||||
// Test all time pattern matches
|
||||
console.log('=== Testing time pattern matches ===\n');
|
||||
|
||||
// Space separator pattern
|
||||
const spacePattern = /\b(\d{1,2})\s+(\d{2})(?!\d)/g;
|
||||
const spaceMatches = snippet.match(spacePattern);
|
||||
console.log('Space-separated times (8 00, 9 30):');
|
||||
console.log(spaceMatches ? spaceMatches.join(', ') : 'none');
|
||||
console.log('');
|
||||
|
||||
// Colon pattern
|
||||
const colonPattern = /\d{1,2}:\d{2}/g;
|
||||
const colonMatches = snippet.match(colonPattern);
|
||||
console.log('Colon times (8:00, 9:30):');
|
||||
console.log(colonMatches ? colonMatches.join(', ') : 'none');
|
||||
console.log('');
|
||||
|
||||
// Polish day names
|
||||
console.log('=== Polish day names in snippet ===\n');
|
||||
const dayConfigs = getDayNamesForCountry('PL');
|
||||
const dayPatterns = buildDayPatterns(dayConfigs);
|
||||
|
||||
for (const [dayName, dayNum] of Object.entries(dayPatterns)) {
|
||||
if (snippet.includes(dayName)) {
|
||||
console.log(`Found: ${dayName} (day ${dayNum})`);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
console.log('\n=== Parsed schedules ===\n');
|
||||
const byDay: Record<number, typeof result.schedules> = {};
|
||||
for (const sched of result.schedules) {
|
||||
if (!byDay[sched.dayOfWeek]) byDay[sched.dayOfWeek] = [];
|
||||
byDay[sched.dayOfWeek].push(sched);
|
||||
}
|
||||
|
||||
const dayNames = ['Niedziela', 'Poniedziałek', 'Wtorek', 'Środa', 'Czwartek', 'Piątek', 'Sobota'];
|
||||
for (let i = 0; i < 7; i++) {
|
||||
if (byDay[i]) {
|
||||
console.log(`${dayNames[i]}: ${byDay[i].map(s => s.time).join(', ')}`);
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
debugPolish().catch(console.error);
|
||||
79
scripts/debug/debug-polish-sunday-monday.ts
Normal file
79
scripts/debug/debug-polish-sunday-monday.ts
Normal file
@@ -0,0 +1,79 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Debug why Sunday and Monday aren't parsing for Polish church
|
||||
*/
|
||||
|
||||
import { getDayNamesForCountry, buildDayPatterns } from '../../src/scrapers/i18n/day-names';
|
||||
|
||||
// Exact schedule text from website
|
||||
const text = `msze święte niedziela i uroczystości: 8 00 , 9 30 (lubojenka), 11 00 , 16 00 w lipcu i sierpniu nie ma mszy popołudniowej!--> dni powszednie: poniedziałek: godz. 8 00 wtorek - sobota: godz. 18 00`.toLowerCase();
|
||||
|
||||
console.log('Text to parse:');
|
||||
console.log(text);
|
||||
console.log('\n');
|
||||
|
||||
const dayConfigs = getDayNamesForCountry('PL');
|
||||
const dayPatterns = buildDayPatterns(dayConfigs);
|
||||
const sortedDayNames = Object.keys(dayPatterns).sort((a, b) => b.length - a.length);
|
||||
const allDayNamesPattern = sortedDayNames.map(d => d.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')).join('|');
|
||||
|
||||
console.log('=== Testing niedziela (Sunday) ===\n');
|
||||
|
||||
// Current regex pattern
|
||||
const niedziela = 'niedziela';
|
||||
const escaped = niedziela.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
|
||||
const regex = new RegExp(
|
||||
`(?:^|\\s|[,;:])${escaped}(?:(?:[^:]{1,50})?:|\\s+)([^]*?)(?=${allDayNamesPattern}|$)`,
|
||||
'i'
|
||||
);
|
||||
|
||||
const match = text.match(regex);
|
||||
if (match) {
|
||||
console.log(`✓ Matched!`);
|
||||
console.log(` Full match: "${match[0].substring(0, 100)}"`);
|
||||
console.log(` Captured text: "${match[1].substring(0, 100)}"`);
|
||||
console.log('');
|
||||
|
||||
// Check if times can be extracted
|
||||
const spacePattern = /\b(\d{1,2})\s+(\d{2})(?!\d)/g;
|
||||
const times = match[1].match(spacePattern);
|
||||
console.log(` Times found: ${times ? times.join(', ') : 'none'}`);
|
||||
} else {
|
||||
console.log(`✗ NOT matched`);
|
||||
}
|
||||
|
||||
console.log('\n=== Testing poniedziałek (Monday) ===\n');
|
||||
|
||||
const ponieRegex = new RegExp(
|
||||
`(?:^|\\s|[,;:])poniedziałek(?:(?:[^:]{1,50})?:|\\s+)([^]*?)(?=${allDayNamesPattern}|$)`,
|
||||
'i'
|
||||
);
|
||||
|
||||
const ponieMatch = text.match(ponieRegex);
|
||||
if (ponieMatch) {
|
||||
console.log(`✓ Matched!`);
|
||||
console.log(` Full match: "${ponieMatch[0].substring(0, 100)}"`);
|
||||
console.log(` Captured text: "${ponieMatch[1].substring(0, 100)}"`);
|
||||
console.log('');
|
||||
|
||||
const times = ponieMatch[1].match(/\b(\d{1,2})\s+(\d{2})(?!\d)/g);
|
||||
console.log(` Times found: ${times ? times.join(', ') : 'none'}`);
|
||||
} else {
|
||||
console.log(`✗ NOT matched`);
|
||||
}
|
||||
|
||||
console.log('\n=== Analyzing why niedziela might fail ===\n');
|
||||
|
||||
// The issue might be "niedziela i uroczystości:" - the phrase is long
|
||||
// Check if the lookahead is hitting "uroczystości" before getting to the times
|
||||
const niedziela_index = text.indexOf('niedziela');
|
||||
const next_day_index = Math.min(
|
||||
...sortedDayNames
|
||||
.filter(d => d !== 'niedziela')
|
||||
.map(d => text.indexOf(d, niedziela_index))
|
||||
.filter(i => i > 0)
|
||||
);
|
||||
|
||||
console.log(`niedziela position: ${niedziela_index}`);
|
||||
console.log(`Next day name position: ${next_day_index}`);
|
||||
console.log(`Text between: "${text.substring(niedziela_index, next_day_index)}"`);
|
||||
44
scripts/debug/debug-thursday-context.ts
Normal file
44
scripts/debug/debug-thursday-context.ts
Normal file
@@ -0,0 +1,44 @@
|
||||
#!/usr/bin/env tsx
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function main() {
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('DE');
|
||||
|
||||
const result = await scraper.scrape('https://www.alterpeter.de/');
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Find "montag bis donnerstag" pattern
|
||||
const pattern = /montag[^]*?bis[^]*?donnerstag/gi;
|
||||
const matches = [...text.matchAll(pattern)];
|
||||
|
||||
console.log(`Found ${matches.length} instances of "montag bis donnerstag":\n`);
|
||||
|
||||
for (let i = 0; i < matches.length; i++) {
|
||||
const match = matches[i];
|
||||
const matchIndex = match.index || 0;
|
||||
const contextBefore = text.substring(Math.max(0, matchIndex - 150), matchIndex);
|
||||
const contextAfter = text.substring(matchIndex, Math.min(text.length, matchIndex + 250));
|
||||
|
||||
console.log(`=== Instance ${i + 1} ===`);
|
||||
console.log(`Position: ${matchIndex}`);
|
||||
console.log(`\nContext BEFORE (150 chars):`);
|
||||
console.log(`"${contextBefore}"`);
|
||||
console.log(`\nContext AFTER (250 chars):`);
|
||||
console.log(`"${contextAfter}"`);
|
||||
console.log('');
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
45
scripts/debug/debug-zero-time.ts
Normal file
45
scripts/debug/debug-zero-time.ts
Normal file
@@ -0,0 +1,45 @@
|
||||
#!/usr/bin/env tsx
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function main() {
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('DE');
|
||||
|
||||
const result = await scraper.scrape('https://www.alterpeter.de/');
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Find all instances of "00 uhr" pattern
|
||||
let idx = 0;
|
||||
let count = 0;
|
||||
const pattern = /\b00\s*uhr/g;
|
||||
let match;
|
||||
|
||||
console.log('Looking for "00 uhr" patterns:\n');
|
||||
|
||||
while ((match = pattern.exec(text)) !== null) {
|
||||
count++;
|
||||
const matchIndex = match.index;
|
||||
const contextBefore = text.substring(Math.max(0, matchIndex - 50), matchIndex);
|
||||
const contextAfter = text.substring(matchIndex, Math.min(text.length, matchIndex + 100));
|
||||
|
||||
console.log(`=== Occurrence ${count} at position ${matchIndex} ===`);
|
||||
console.log(`BEFORE: "...${contextBefore}"`);
|
||||
console.log(`MATCH + AFTER: "${contextAfter}..."`);
|
||||
console.log('');
|
||||
}
|
||||
|
||||
console.log(`Total "00 uhr" occurrences: ${count}`);
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
37
scripts/debug/export-de-from-neon.ts
Normal file
37
scripts/debug/export-de-from-neon.ts
Normal file
@@ -0,0 +1,37 @@
|
||||
#!/usr/bin/env tsx
|
||||
import { config } from 'dotenv';
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import fs from 'fs/promises';
|
||||
|
||||
config({ path: '.env.local' });
|
||||
|
||||
async function main() {
|
||||
console.log('📦 Exporting Germany from Neon...');
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
await prisma.$connect();
|
||||
|
||||
const churches = await prisma.churches.findMany({
|
||||
where: { country: 'DE' },
|
||||
include: {
|
||||
massSchedules: true,
|
||||
confessionSchedules: true,
|
||||
adorationSchedules: true,
|
||||
}
|
||||
});
|
||||
|
||||
console.log(`Found ${churches.length} churches in Germany`);
|
||||
|
||||
await fs.writeFile('export-DE.json', JSON.stringify(churches, null, 2));
|
||||
console.log(`✅ Exported to export-DE.json`);
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
60
scripts/debug/export-from-nas.ts
Normal file
60
scripts/debug/export-from-nas.ts
Normal file
@@ -0,0 +1,60 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Export churches from NAS database to JSON
|
||||
* Run this ON THE NAS (uses DATABASE_URL from .env)
|
||||
*/
|
||||
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import fs from 'fs/promises';
|
||||
|
||||
async function main() {
|
||||
const country = process.argv[2] || 'PL';
|
||||
|
||||
console.log(`📦 Exporting ${country} data from database...`);
|
||||
console.log(`DATABASE_URL: ${process.env.DATABASE_URL?.replace(/:[^:@]+@/, ':****@')}`);
|
||||
|
||||
const prisma = new PrismaClient();
|
||||
|
||||
try {
|
||||
await prisma.$connect();
|
||||
console.log('✅ Connected to database');
|
||||
|
||||
// Export churches with all schedules
|
||||
const churches = await prisma.churches.findMany({
|
||||
where: { country },
|
||||
include: {
|
||||
massSchedules: true,
|
||||
confessionSchedules: true,
|
||||
adorationSchedules: true,
|
||||
}
|
||||
});
|
||||
|
||||
console.log(`Found ${churches.length} churches in ${country}`);
|
||||
|
||||
// Count schedules
|
||||
const massSchedules = churches.reduce((sum, c) => sum + (c.massSchedules?.length || 0), 0);
|
||||
const confessionSchedules = churches.reduce((sum, c) => sum + (c.confessionSchedules?.length || 0), 0);
|
||||
const adorationSchedules = churches.reduce((sum, c) => sum + (c.adorationSchedules?.length || 0), 0);
|
||||
|
||||
// Save to file
|
||||
const exportFile = `export-${country}.json`;
|
||||
await fs.writeFile(exportFile, JSON.stringify(churches, null, 2));
|
||||
|
||||
console.log(`\n✅ Exported to ${exportFile}`);
|
||||
console.log(` - ${churches.length} churches`);
|
||||
console.log(` - ${massSchedules} mass schedules`);
|
||||
console.log(` - ${confessionSchedules} confession schedules`);
|
||||
console.log(` - ${adorationSchedules} adoration schedules`);
|
||||
console.log(`\nDownload with:`);
|
||||
console.log(` scp albert@192.168.0.145:/volume1/docker/nearestmass/${exportFile} .`);
|
||||
|
||||
await prisma.$disconnect();
|
||||
|
||||
} catch (error) {
|
||||
console.error('❌ Export failed:', error);
|
||||
await prisma.$disconnect();
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
230
scripts/debug/export-import-to-neon.ts
Normal file
230
scripts/debug/export-import-to-neon.ts
Normal file
@@ -0,0 +1,230 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Export churches from local NAS database and import to Neon
|
||||
*/
|
||||
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import fs from 'fs/promises';
|
||||
import path from 'path';
|
||||
|
||||
interface ExportStats {
|
||||
churches: number;
|
||||
massSchedules: number;
|
||||
confessionSchedules: number;
|
||||
adorationSchedules: number;
|
||||
}
|
||||
|
||||
async function exportFromNAS(country: string): Promise<ExportStats> {
|
||||
console.log(`📦 Exporting ${country} data from NAS...`);
|
||||
|
||||
// Set DATABASE_URL to NAS
|
||||
const originalUrl = process.env.DATABASE_URL;
|
||||
process.env.DATABASE_URL = 'postgresql://postgres:postgres@192.168.0.145:5432/nearestmass';
|
||||
|
||||
const nasPrisma = new PrismaClient();
|
||||
|
||||
try {
|
||||
await nasPrisma.$connect();
|
||||
console.log('✅ Connected to NAS database');
|
||||
|
||||
// Export churches with all schedules
|
||||
const churches = await nasPrisma.churches.findMany({
|
||||
where: { country },
|
||||
include: {
|
||||
massSchedules: true,
|
||||
confessionSchedules: true,
|
||||
adorationSchedules: true,
|
||||
}
|
||||
});
|
||||
|
||||
console.log(`Found ${churches.length} churches in ${country}`);
|
||||
|
||||
// Count schedules
|
||||
const stats: ExportStats = {
|
||||
churches: churches.length,
|
||||
massSchedules: churches.reduce((sum, c) => sum + (c.massSchedules?.length || 0), 0),
|
||||
confessionSchedules: churches.reduce((sum, c) => sum + (c.confessionSchedules?.length || 0), 0),
|
||||
adorationSchedules: churches.reduce((sum, c) => sum + (c.adorationSchedules?.length || 0), 0),
|
||||
};
|
||||
|
||||
// Save to file
|
||||
const exportFile = path.join(process.cwd(), `export-${country}.json`);
|
||||
await fs.writeFile(exportFile, JSON.stringify(churches, null, 2));
|
||||
console.log(`✅ Exported to ${exportFile}`);
|
||||
console.log(` - ${stats.churches} churches`);
|
||||
console.log(` - ${stats.massSchedules} mass schedules`);
|
||||
console.log(` - ${stats.confessionSchedules} confession schedules`);
|
||||
console.log(` - ${stats.adorationSchedules} adoration schedules`);
|
||||
|
||||
await nasPrisma.$disconnect();
|
||||
|
||||
// Restore original DATABASE_URL
|
||||
if (originalUrl) {
|
||||
process.env.DATABASE_URL = originalUrl;
|
||||
}
|
||||
|
||||
return stats;
|
||||
|
||||
} catch (error) {
|
||||
console.error('❌ Export failed:', error);
|
||||
await nasPrisma.$disconnect();
|
||||
|
||||
// Restore original DATABASE_URL
|
||||
if (originalUrl) {
|
||||
process.env.DATABASE_URL = originalUrl;
|
||||
}
|
||||
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
async function importToNeon(country: string, dryRun: boolean = true): Promise<void> {
|
||||
console.log(`\n📤 Importing ${country} data to Neon...`);
|
||||
if (dryRun) {
|
||||
console.log('🔍 DRY RUN MODE - No data will be written');
|
||||
}
|
||||
|
||||
// Read export file
|
||||
const exportFile = path.join(process.cwd(), `export-${country}.json`);
|
||||
const data = JSON.parse(await fs.readFile(exportFile, 'utf-8'));
|
||||
console.log(`Loaded ${data.length} churches from export file`);
|
||||
|
||||
// Connect to Neon
|
||||
const neonPrisma = new PrismaClient();
|
||||
|
||||
try {
|
||||
await neonPrisma.$connect();
|
||||
console.log('✅ Connected to Neon database');
|
||||
|
||||
let inserted = 0;
|
||||
let updated = 0;
|
||||
let errors = 0;
|
||||
|
||||
for (const church of data) {
|
||||
try {
|
||||
const massSchedules = church.massSchedules || [];
|
||||
const confessionSchedules = church.confessionSchedules || [];
|
||||
const adorationSchedules = church.adorationSchedules || [];
|
||||
|
||||
// Remove relations and ids
|
||||
delete church.massSchedules;
|
||||
delete church.confessionSchedules;
|
||||
delete church.adorationSchedules;
|
||||
delete church.id;
|
||||
|
||||
if (!dryRun) {
|
||||
// Upsert church based on coordinates
|
||||
const result = await neonPrisma.churches.upsert({
|
||||
where: {
|
||||
latitude_longitude: {
|
||||
latitude: church.latitude,
|
||||
longitude: church.longitude
|
||||
}
|
||||
},
|
||||
create: church,
|
||||
update: church
|
||||
});
|
||||
|
||||
// Check if it was an insert or update
|
||||
const existing = await neonPrisma.churches.findFirst({
|
||||
where: {
|
||||
latitude: church.latitude,
|
||||
longitude: church.longitude,
|
||||
createdAt: { lt: new Date(Date.now() - 1000) } // Created more than 1 sec ago
|
||||
}
|
||||
});
|
||||
|
||||
if (existing) {
|
||||
updated++;
|
||||
} else {
|
||||
inserted++;
|
||||
}
|
||||
|
||||
// Insert schedules
|
||||
for (const schedule of massSchedules) {
|
||||
delete schedule.id;
|
||||
await neonPrisma.massSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: result.id
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
for (const schedule of confessionSchedules) {
|
||||
delete schedule.id;
|
||||
await neonPrisma.confessionSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: result.id
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
for (const schedule of adorationSchedules) {
|
||||
delete schedule.id;
|
||||
await neonPrisma.adorationSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: result.id
|
||||
}
|
||||
});
|
||||
}
|
||||
} else {
|
||||
// Dry run - just count
|
||||
inserted++;
|
||||
}
|
||||
|
||||
if (inserted % 100 === 0) {
|
||||
console.log(`Progress: ${inserted + updated} churches processed...`);
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
errors++;
|
||||
console.error(`Error importing church ${church.name}:`, error instanceof Error ? error.message : error);
|
||||
}
|
||||
}
|
||||
|
||||
console.log('\n✅ Import complete!');
|
||||
console.log(` - ${inserted} churches inserted`);
|
||||
console.log(` - ${updated} churches updated`);
|
||||
console.log(` - ${errors} errors`);
|
||||
|
||||
await neonPrisma.$disconnect();
|
||||
|
||||
} catch (error) {
|
||||
console.error('❌ Import failed:', error);
|
||||
await neonPrisma.$disconnect();
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const country = process.argv[2] || 'PL';
|
||||
const mode = process.argv[3] || 'dry-run';
|
||||
const dryRun = mode === 'dry-run';
|
||||
|
||||
console.log('🌍 Export/Import to Neon');
|
||||
console.log('========================\n');
|
||||
|
||||
try {
|
||||
// Step 1: Export from NAS
|
||||
const stats = await exportFromNAS(country);
|
||||
|
||||
// Step 2: Import to Neon
|
||||
await importToNeon(country, dryRun);
|
||||
|
||||
if (dryRun) {
|
||||
console.log('\n💡 This was a DRY RUN. To actually import to Neon, run:');
|
||||
console.log(` npx tsx scripts/export-import-to-neon.ts ${country} real-import`);
|
||||
} else {
|
||||
console.log('\n🎉 Data successfully uploaded to Neon!');
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
console.error('❌ Process failed:', error);
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
41
scripts/debug/find-donnerstag-sections.ts
Normal file
41
scripts/debug/find-donnerstag-sections.ts
Normal file
@@ -0,0 +1,41 @@
|
||||
#!/usr/bin/env tsx
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function main() {
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('DE');
|
||||
|
||||
const result = await scraper.scrape('https://www.alterpeter.de/');
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Find all instances of "donnerstag" (Thursday)
|
||||
let idx = 0;
|
||||
let count = 0;
|
||||
while ((idx = text.indexOf('donnerstag', idx)) !== -1) {
|
||||
count++;
|
||||
const contextBefore = text.substring(Math.max(0, idx - 100), idx);
|
||||
const contextAfter = text.substring(idx, Math.min(text.length, idx + 200));
|
||||
|
||||
console.log(`=== Donnerstag occurrence ${count} at position ${idx} ===`);
|
||||
console.log(`BEFORE: "...${contextBefore}"`);
|
||||
console.log(`AFTER: "${contextAfter}..."`);
|
||||
console.log('');
|
||||
|
||||
idx++;
|
||||
}
|
||||
|
||||
console.log(`Total "donnerstag" occurrences: ${count}`);
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
42
scripts/debug/find-office-hours-pattern.ts
Normal file
42
scripts/debug/find-office-hours-pattern.ts
Normal file
@@ -0,0 +1,42 @@
|
||||
#!/usr/bin/env tsx
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function main() {
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('DE');
|
||||
|
||||
const result = await scraper.scrape('https://www.alterpeter.de/');
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
const idx = text.indexOf('9.00 – 12.00');
|
||||
if (idx !== -1) {
|
||||
console.log('Context around "9.00 – 12.00":');
|
||||
console.log(text.substring(Math.max(0, idx - 150), idx + 200));
|
||||
} else {
|
||||
console.log('Pattern "9.00 – 12.00" not found');
|
||||
|
||||
// Try alternative patterns
|
||||
const patterns = ['9.00', '9:00', '09:00', '09.00'];
|
||||
for (const pattern of patterns) {
|
||||
const idx2 = text.indexOf(pattern);
|
||||
if (idx2 !== -1) {
|
||||
console.log(`\nFound "${pattern}" at position ${idx2}:`);
|
||||
console.log(text.substring(Math.max(0, idx2 - 100), idx2 + 150));
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
102
scripts/debug/identify-top5-bugs.ts
Normal file
102
scripts/debug/identify-top5-bugs.ts
Normal file
@@ -0,0 +1,102 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Identify which churches are flagged as "parsing bugs" in top 5 test
|
||||
*/
|
||||
|
||||
import { config } from 'dotenv';
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
const COUNTRIES = [
|
||||
{ code: 'FR', name: 'France' },
|
||||
{ code: 'DE', name: 'Germany' },
|
||||
{ code: 'ES', name: 'Spain' },
|
||||
{ code: 'PL', name: 'Poland' },
|
||||
{ code: 'BR', name: 'Brazil' },
|
||||
];
|
||||
|
||||
async function identifyBugs() {
|
||||
console.log('Identifying "parsing bugs" from top 5 test...\n');
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
|
||||
const bugs: Array<{
|
||||
country: string;
|
||||
church: string;
|
||||
url: string;
|
||||
hasDays: boolean;
|
||||
hasTimes: boolean;
|
||||
}> = [];
|
||||
|
||||
for (const country of COUNTRIES) {
|
||||
const churches = await prisma.church.findMany({
|
||||
where: {
|
||||
country: country.code,
|
||||
website: { not: null },
|
||||
source: 'osm',
|
||||
},
|
||||
take: 10,
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
|
||||
scraper.setCountry(country.code);
|
||||
|
||||
for (const church of churches) {
|
||||
try {
|
||||
const result = await scraper.scrape(church.website!);
|
||||
|
||||
if (!result.success && result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Check for day names and times
|
||||
const hasDays = text.match(/\b(sunday|monday|tuesday|wednesday|thursday|friday|saturday|dimanche|lundi|mardi|mercredi|jeudi|vendredi|samedi|sonntag|montag|dienstag|mittwoch|donnerstag|freitag|samstag|domingo|domingos|lunes|martes|miércoles|miercoles|jueves|viernes|sábado|sabado|sábados|sabados|niedziela|poniedziałek|poniedzialek|wtorek|środa|sroda|czwartek|piątek|piatek|sobota|segunda|segundas|terça|terca|terças|tercas|quarta|quartas|quinta|quintas|sexta|sextas)\b/i);
|
||||
|
||||
const hasTimes = text.match(/\d{1,2}[h:\.]?\s*\d{0,2}\s*(am|pm|h|uhr)?/i);
|
||||
|
||||
if (hasDays && hasTimes) {
|
||||
bugs.push({
|
||||
country: country.name,
|
||||
church: church.name,
|
||||
url: church.website!,
|
||||
hasDays: !!hasDays,
|
||||
hasTimes: !!hasTimes,
|
||||
});
|
||||
}
|
||||
}
|
||||
} catch (err: any) {
|
||||
// Skip errors
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
|
||||
console.log(`\n${'='.repeat(80)}`);
|
||||
console.log(`FOUND ${bugs.length} POTENTIAL PARSING BUGS\n`);
|
||||
|
||||
bugs.forEach((bug, i) => {
|
||||
console.log(`${i + 1}. ${bug.church} (${bug.country})`);
|
||||
console.log(` URL: ${bug.url}`);
|
||||
console.log('');
|
||||
});
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
identifyBugs().catch(console.error);
|
||||
232
scripts/debug/import-to-neon.ts
Normal file
232
scripts/debug/import-to-neon.ts
Normal file
@@ -0,0 +1,232 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Import churches from JSON export to Neon database
|
||||
* Run this LOCALLY (uses DATABASE_URL from .env pointing to Neon)
|
||||
*/
|
||||
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import fs from 'fs/promises';
|
||||
import path from 'path';
|
||||
|
||||
interface ChurchExport {
|
||||
id: string;
|
||||
name: string;
|
||||
latitude: number;
|
||||
longitude: number;
|
||||
country: string;
|
||||
massSchedules?: any[];
|
||||
confessionSchedules?: any[];
|
||||
adorationSchedules?: any[];
|
||||
[key: string]: any;
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const country = process.argv[2] || 'PL';
|
||||
const mode = process.argv[3] || 'dry-run';
|
||||
const dryRun = mode === 'dry-run';
|
||||
|
||||
console.log(`📤 Importing ${country} data to Neon...`);
|
||||
console.log(`DATABASE_URL: ${process.env.DATABASE_URL?.replace(/:[^:@]+@/, ':****@')}`);
|
||||
|
||||
if (dryRun) {
|
||||
console.log('🔍 DRY RUN MODE - No data will be written');
|
||||
}
|
||||
|
||||
// Read export file
|
||||
const exportFile = path.join(process.cwd(), `export-${country}.json`);
|
||||
|
||||
try {
|
||||
const data: ChurchExport[] = JSON.parse(await fs.readFile(exportFile, 'utf-8'));
|
||||
console.log(`Loaded ${data.length} churches from export file`);
|
||||
|
||||
// Connect to Neon
|
||||
const prisma = new PrismaClient();
|
||||
|
||||
try {
|
||||
await prisma.$connect();
|
||||
console.log('✅ Connected to Neon database');
|
||||
|
||||
let inserted = 0;
|
||||
let updated = 0;
|
||||
let skipped = 0;
|
||||
let errors = 0;
|
||||
let totalMassSchedules = 0;
|
||||
let totalConfessionSchedules = 0;
|
||||
let totalAdorationSchedules = 0;
|
||||
|
||||
for (const church of data) {
|
||||
try {
|
||||
const massSchedules = church.massSchedules || [];
|
||||
const confessionSchedules = church.confessionSchedules || [];
|
||||
const adorationSchedules = church.adorationSchedules || [];
|
||||
|
||||
// Remove relations and ids
|
||||
delete church.massSchedules;
|
||||
delete church.confessionSchedules;
|
||||
delete church.adorationSchedules;
|
||||
delete church.id;
|
||||
|
||||
if (!dryRun) {
|
||||
// Check if church already exists
|
||||
const existing = await prisma.churches.findFirst({
|
||||
where: {
|
||||
latitude: church.latitude,
|
||||
longitude: church.longitude
|
||||
}
|
||||
});
|
||||
|
||||
if (existing) {
|
||||
// Update existing church
|
||||
await prisma.churches.update({
|
||||
where: { id: existing.id },
|
||||
data: church
|
||||
});
|
||||
|
||||
// Delete existing schedules
|
||||
await prisma.massSchedules.deleteMany({
|
||||
where: { churchId: existing.id }
|
||||
});
|
||||
await prisma.confessionSchedules.deleteMany({
|
||||
where: { churchId: existing.id }
|
||||
});
|
||||
await prisma.adorationSchedules.deleteMany({
|
||||
where: { churchId: existing.id }
|
||||
});
|
||||
|
||||
// Insert new schedules
|
||||
for (const schedule of massSchedules) {
|
||||
delete schedule.id;
|
||||
await prisma.massSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: existing.id
|
||||
}
|
||||
});
|
||||
totalMassSchedules++;
|
||||
}
|
||||
|
||||
for (const schedule of confessionSchedules) {
|
||||
delete schedule.id;
|
||||
await prisma.confessionSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: existing.id
|
||||
}
|
||||
});
|
||||
totalConfessionSchedules++;
|
||||
}
|
||||
|
||||
for (const schedule of adorationSchedules) {
|
||||
delete schedule.id;
|
||||
await prisma.adorationSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: existing.id
|
||||
}
|
||||
});
|
||||
totalAdorationSchedules++;
|
||||
}
|
||||
|
||||
updated++;
|
||||
} else {
|
||||
// Create new church
|
||||
const result = await prisma.churches.create({
|
||||
data: church
|
||||
});
|
||||
|
||||
// Insert schedules
|
||||
for (const schedule of massSchedules) {
|
||||
delete schedule.id;
|
||||
await prisma.massSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: result.id
|
||||
}
|
||||
});
|
||||
totalMassSchedules++;
|
||||
}
|
||||
|
||||
for (const schedule of confessionSchedules) {
|
||||
delete schedule.id;
|
||||
await prisma.confessionSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: result.id
|
||||
}
|
||||
});
|
||||
totalConfessionSchedules++;
|
||||
}
|
||||
|
||||
for (const schedule of adorationSchedules) {
|
||||
delete schedule.id;
|
||||
await prisma.adorationSchedules.create({
|
||||
data: {
|
||||
...schedule,
|
||||
churchId: result.id
|
||||
}
|
||||
});
|
||||
totalAdorationSchedules++;
|
||||
}
|
||||
|
||||
inserted++;
|
||||
}
|
||||
} else {
|
||||
// Dry run - just count
|
||||
inserted++;
|
||||
totalMassSchedules += massSchedules.length;
|
||||
totalConfessionSchedules += confessionSchedules.length;
|
||||
totalAdorationSchedules += adorationSchedules.length;
|
||||
}
|
||||
|
||||
if ((inserted + updated) % 100 === 0) {
|
||||
console.log(`Progress: ${inserted + updated} churches processed...`);
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
errors++;
|
||||
console.error(`Error importing church ${church.name}:`, error instanceof Error ? error.message : error);
|
||||
}
|
||||
}
|
||||
|
||||
console.log('\n✅ Import complete!');
|
||||
console.log(` - ${inserted} churches inserted`);
|
||||
console.log(` - ${updated} churches updated`);
|
||||
console.log(` - ${skipped} churches skipped`);
|
||||
console.log(` - ${errors} errors`);
|
||||
console.log(` - ${totalMassSchedules} mass schedules`);
|
||||
console.log(` - ${totalConfessionSchedules} confession schedules`);
|
||||
console.log(` - ${totalAdorationSchedules} adoration schedules`);
|
||||
|
||||
await prisma.$disconnect();
|
||||
|
||||
if (dryRun) {
|
||||
console.log('\n💡 This was a DRY RUN. To actually import to Neon, run:');
|
||||
console.log(` npx tsx scripts/import-to-neon.ts ${country} real-import`);
|
||||
} else {
|
||||
console.log('\n🎉 Data successfully uploaded to Neon!');
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
console.error('❌ Import failed:', error);
|
||||
await prisma.$disconnect();
|
||||
throw error;
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
if (error instanceof Error && 'code' in error && error.code === 'ENOENT') {
|
||||
console.error(`❌ Export file not found: ${exportFile}`);
|
||||
console.error(`\nFirst, export data from NAS:`);
|
||||
console.error(` ssh albert@192.168.0.145`);
|
||||
console.error(` cd /volume1/docker/nearestmass`);
|
||||
console.error(` /usr/local/bin/docker compose --profile tools run --rm scraper npx tsx scripts/export-from-nas.ts ${country}`);
|
||||
console.error(`\nThen download the export:`);
|
||||
console.error(` scp albert@192.168.0.145:/volume1/docker/nearestmass/export-${country}.json .`);
|
||||
console.error(`\nFinally, run this import script again.`);
|
||||
} else {
|
||||
console.error('❌ Process failed:', error);
|
||||
}
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
84
scripts/debug/investigate-8-bugs.ts
Normal file
84
scripts/debug/investigate-8-bugs.ts
Normal file
@@ -0,0 +1,84 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Investigate the 8 potential parsing bugs
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
const BUGS = [
|
||||
{ name: 'Chapelle Saint-Jean-XXIII', country: 'FR', url: 'https://www.chemin-neuf.fr/' },
|
||||
{ name: 'St. Marien', country: 'DE', url: 'https://www.willehad.de/start/' },
|
||||
{ name: 'Iglesia de San Fernando', country: 'ES', url: 'https://www.parroquiasanfernandomaspalomas.net/de/' },
|
||||
{ name: 'Monestir de Sant Esperit', country: 'ES', url: 'https://www.santoespiritu.org/' },
|
||||
{ name: 'Santuario de Manalagua', country: 'ES', url: 'http://tierrasdeburgos.blogspot.com.es/2013/12/escultura-del-agua-santuario-de.html' },
|
||||
{ name: 'Kościół pw. Najświętszego Serca', country: 'PL', url: 'http://parafialubojna.pl' },
|
||||
{ name: 'Paróquia do Desterro', country: 'BR', url: 'https://paroquiaportodegalinhas.blogspot.com.br/' },
|
||||
{ name: 'Catedral Diocesana', country: 'BR', url: 'http://diocesedejuazeiro.org.br/' },
|
||||
];
|
||||
|
||||
async function investigate() {
|
||||
console.log('Investigating 8 potential bugs...\n');
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
|
||||
for (let i = 0; i < BUGS.length; i++) {
|
||||
const bug = BUGS[i];
|
||||
console.log(`${'='.repeat(80)}`);
|
||||
console.log(`${i + 1}. ${bug.name} (${bug.country})`);
|
||||
console.log(` ${bug.url}`);
|
||||
console.log('='.repeat(80));
|
||||
|
||||
scraper.setCountry(bug.country);
|
||||
|
||||
try {
|
||||
const result = await scraper.scrape(bug.url);
|
||||
|
||||
console.log(`Success: ${result.success}`);
|
||||
console.log(`Schedules: ${result.schedules.length}`);
|
||||
console.log(`Error: ${result.error || 'none'}`);
|
||||
|
||||
if (!result.success && result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Check page type
|
||||
console.log('\nPage analysis:');
|
||||
if (text.includes('blogspot')) {
|
||||
console.log(' ⚠️ Blogspot page (likely blog post, not church website)');
|
||||
}
|
||||
if (text.includes('hotel') || text.includes('reservation') || text.includes('booking')) {
|
||||
console.log(' ⚠️ Contains hotel/booking keywords');
|
||||
}
|
||||
if (text.includes('restaurant') || text.includes('menu')) {
|
||||
console.log(' ⚠️ Contains restaurant keywords');
|
||||
}
|
||||
if (text.includes('404') || text.includes('not found') || text.includes('error')) {
|
||||
console.log(' ⚠️ Error/404 page');
|
||||
}
|
||||
|
||||
// Check if it has schedule keywords
|
||||
const hasScheduleKeywords = text.match(/(mass|messe|misa|missa|horário|horario|gottesdienst|eucarist)/i);
|
||||
console.log(` Schedule keywords: ${hasScheduleKeywords ? '✓ Found' : '✗ Not found'}`);
|
||||
|
||||
// Show sample text
|
||||
const massIndex = text.indexOf('mass') || text.indexOf('messe') || text.indexOf('misa') || text.indexOf('missa') || 0;
|
||||
const sampleStart = Math.max(0, massIndex - 50);
|
||||
const sample = text.substring(sampleStart, sampleStart + 300);
|
||||
console.log(`\n Sample text: "${sample.substring(0, 200)}..."`);
|
||||
}
|
||||
|
||||
console.log('\n');
|
||||
} catch (err: any) {
|
||||
console.log(`ERROR: ${err.message}\n\n`);
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
investigate().catch(console.error);
|
||||
134
scripts/debug/list-church-websites.ts
Normal file
134
scripts/debug/list-church-websites.ts
Normal file
@@ -0,0 +1,134 @@
|
||||
import { config } from 'dotenv';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
|
||||
// Load .env.local first, then .env
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
const connectionString = process.env.DATABASE_URL;
|
||||
|
||||
if (!connectionString) {
|
||||
throw new Error('DATABASE_URL environment variable is not set');
|
||||
}
|
||||
|
||||
const pool = new Pool({ connectionString });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
async function listChurchWebsites() {
|
||||
try {
|
||||
console.log('Fetching churches from database...\n');
|
||||
|
||||
const churches = await prisma.church.findMany({
|
||||
select: {
|
||||
id: true,
|
||||
name: true,
|
||||
city: true,
|
||||
state: true,
|
||||
country: true,
|
||||
website: true,
|
||||
googlePlaceId: true,
|
||||
},
|
||||
orderBy: [
|
||||
{ country: 'asc' },
|
||||
{ state: 'asc' },
|
||||
{ city: 'asc' },
|
||||
],
|
||||
});
|
||||
|
||||
console.log(`Total churches: ${churches.length}`);
|
||||
|
||||
const withWebsite = churches.filter(c => c.website);
|
||||
const withGoogle = churches.filter(c => c.googlePlaceId);
|
||||
const withoutWebsite = churches.filter(c => !c.website);
|
||||
|
||||
console.log(`Churches with website: ${withWebsite.length}`);
|
||||
console.log(`Churches with Google Place ID: ${withGoogle.length}`);
|
||||
console.log(`Churches without website: ${withoutWebsite.length}\n`);
|
||||
|
||||
// Group by country
|
||||
const byCountry = churches.reduce((acc, church) => {
|
||||
const country = church.country || 'Unknown';
|
||||
if (!acc[country]) {
|
||||
acc[country] = [];
|
||||
}
|
||||
acc[country].push(church);
|
||||
return acc;
|
||||
}, {} as Record<string, typeof churches>);
|
||||
|
||||
// Write to file
|
||||
let output = '# Church Websites\n\n';
|
||||
output += `Generated: ${new Date().toISOString()}\n\n`;
|
||||
output += `## Summary\n`;
|
||||
output += `- Total churches: ${churches.length}\n`;
|
||||
output += `- With website: ${withWebsite.length} (${((withWebsite.length / churches.length) * 100).toFixed(1)}%)\n`;
|
||||
output += `- With Google Place ID: ${withGoogle.length} (${((withGoogle.length / churches.length) * 100).toFixed(1)}%)\n`;
|
||||
output += `- Without website: ${withoutWebsite.length} (${((withoutWebsite.length / churches.length) * 100).toFixed(1)}%)\n\n`;
|
||||
|
||||
// Add country breakdown
|
||||
output += `## By Country\n\n`;
|
||||
Object.entries(byCountry)
|
||||
.sort(([, a], [, b]) => b.length - a.length)
|
||||
.forEach(([country, countryChurches]) => {
|
||||
const withSite = countryChurches.filter(c => c.website).length;
|
||||
const withGoogle = countryChurches.filter(c => c.googlePlaceId).length;
|
||||
output += `### ${country} (${countryChurches.length} churches)\n`;
|
||||
output += `- With website: ${withSite} (${((withSite / countryChurches.length) * 100).toFixed(1)}%)\n`;
|
||||
output += `- With Google Place ID: ${withGoogle} (${((withGoogle / countryChurches.length) * 100).toFixed(1)}%)\n\n`;
|
||||
});
|
||||
|
||||
// List all websites
|
||||
output += `## All Websites\n\n`;
|
||||
Object.entries(byCountry)
|
||||
.sort(([a], [b]) => a.localeCompare(b))
|
||||
.forEach(([country, countryChurches]) => {
|
||||
output += `### ${country}\n\n`;
|
||||
countryChurches.forEach(church => {
|
||||
const location = [church.city, church.state, church.country].filter(Boolean).join(', ');
|
||||
if (church.website) {
|
||||
output += `- **${church.name}** (${location})\n`;
|
||||
output += ` - Website: ${church.website}\n`;
|
||||
if (church.googlePlaceId) {
|
||||
output += ` - Google Place ID: ${church.googlePlaceId}\n`;
|
||||
}
|
||||
output += ` - DB ID: ${church.id}\n\n`;
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
// List churches without websites
|
||||
output += `## Churches Without Websites\n\n`;
|
||||
Object.entries(byCountry)
|
||||
.sort(([a], [b]) => a.localeCompare(b))
|
||||
.forEach(([country, countryChurches]) => {
|
||||
const without = countryChurches.filter(c => !c.website);
|
||||
if (without.length > 0) {
|
||||
output += `### ${country}\n\n`;
|
||||
without.forEach(church => {
|
||||
const location = [church.city, church.state, church.country].filter(Boolean).join(', ');
|
||||
output += `- **${church.name}** (${location})\n`;
|
||||
if (church.googlePlaceId) {
|
||||
output += ` - Google Place ID: ${church.googlePlaceId}\n`;
|
||||
}
|
||||
output += ` - DB ID: ${church.id}\n\n`;
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
// Write to file
|
||||
const fs = await import('fs/promises');
|
||||
await fs.writeFile('church-websites.md', output);
|
||||
console.log('✓ Written to church-websites.md');
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error:', error);
|
||||
process.exit(1);
|
||||
} finally {
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
listChurchWebsites();
|
||||
44
scripts/debug/list-tables.ts
Normal file
44
scripts/debug/list-tables.ts
Normal file
@@ -0,0 +1,44 @@
|
||||
import { Pool } from 'pg';
|
||||
import * as dotenv from 'dotenv';
|
||||
import * as path from 'path';
|
||||
|
||||
// Load .env.local first (takes precedence), then .env
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
const pool = new Pool({
|
||||
connectionString: process.env.DATABASE_URL,
|
||||
});
|
||||
|
||||
async function listTables() {
|
||||
try {
|
||||
console.log('Connecting to database...');
|
||||
console.log('DATABASE_URL:', process.env.DATABASE_URL?.replace(/:[^:@]+@/, ':****@'));
|
||||
|
||||
// List all tables
|
||||
const result = await pool.query(`
|
||||
SELECT table_name
|
||||
FROM information_schema.tables
|
||||
WHERE table_schema = 'public'
|
||||
ORDER BY table_name;
|
||||
`);
|
||||
|
||||
console.log('\n=== Tables in Database ===');
|
||||
if (result.rows.length === 0) {
|
||||
console.log('No tables found!');
|
||||
} else {
|
||||
result.rows.forEach((row) => {
|
||||
console.log(`- ${row.table_name}`);
|
||||
});
|
||||
}
|
||||
|
||||
console.log(`\nTotal: ${result.rows.length} tables`);
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error listing tables:', error);
|
||||
} finally {
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
listTables();
|
||||
167
scripts/debug/pipeline-report.js
Normal file
167
scripts/debug/pipeline-report.js
Normal file
@@ -0,0 +1,167 @@
|
||||
const { Client } = require("pg");
|
||||
const client = new Client({
|
||||
connectionString: "postgresql://postgres:postgres@192.168.0.145:5434/nearestmass"
|
||||
});
|
||||
|
||||
const queries = [
|
||||
{
|
||||
name: "1. Overall church counts by country (top 20)",
|
||||
sql: `SELECT country, COUNT(*) as total,
|
||||
COUNT(*) FILTER (WHERE website IS NOT NULL) as has_website,
|
||||
COUNT(*) FILTER (WHERE last_scraped_at IS NOT NULL) as scraped,
|
||||
COUNT(*) FILTER (WHERE has_website = true) as has_website_flag,
|
||||
COUNT(*) FILTER (WHERE website_language IS NOT NULL) as has_language
|
||||
FROM churches
|
||||
GROUP BY country
|
||||
ORDER BY total DESC
|
||||
LIMIT 20`
|
||||
},
|
||||
{
|
||||
name: "2. Total mass schedule counts",
|
||||
sql: `SELECT COUNT(*) as total_schedules,
|
||||
COUNT(DISTINCT church_id) as churches_with_schedules
|
||||
FROM mass_schedules`
|
||||
},
|
||||
{
|
||||
name: "3. Scrape results by language",
|
||||
sql: `SELECT website_language as language,
|
||||
COUNT(*) as total_scraped,
|
||||
COUNT(*) FILTER (WHERE last_scraped_at IS NOT NULL) as scraped
|
||||
FROM churches
|
||||
WHERE website_language IS NOT NULL
|
||||
GROUP BY website_language
|
||||
ORDER BY total_scraped DESC`
|
||||
},
|
||||
{
|
||||
name: "4. Churches with websites but never scraped",
|
||||
sql: `SELECT COUNT(*) as has_website_not_scraped
|
||||
FROM churches
|
||||
WHERE website IS NOT NULL AND last_scraped_at IS NULL`
|
||||
},
|
||||
{
|
||||
name: "5. Overall pipeline funnel",
|
||||
sql: `SELECT
|
||||
COUNT(*) as total_churches,
|
||||
COUNT(*) FILTER (WHERE website IS NOT NULL) as has_website,
|
||||
COUNT(*) FILTER (WHERE last_scraped_at IS NOT NULL) as attempted_scrape,
|
||||
COUNT(*) FILTER (WHERE website_language IS NOT NULL) as has_detected_language,
|
||||
(SELECT COUNT(DISTINCT church_id) FROM mass_schedules) as has_schedules_saved,
|
||||
(SELECT COUNT(*) FROM mass_schedules) as total_schedule_rows
|
||||
FROM churches`
|
||||
},
|
||||
{
|
||||
name: "6. Recent scrape activity (last 7 days) by language",
|
||||
sql: `SELECT website_language as language,
|
||||
COUNT(*) as scraped_last_7d
|
||||
FROM churches
|
||||
WHERE last_scraped_at > NOW() - INTERVAL '7 days'
|
||||
GROUP BY website_language
|
||||
ORDER BY scraped_last_7d DESC`
|
||||
},
|
||||
{
|
||||
name: "7. Background job history (last 15 completed/failed jobs)",
|
||||
sql: `SELECT type, language, status,
|
||||
created_at::date as created,
|
||||
completed_at::date as completed,
|
||||
ROUND(CAST(EXTRACT(EPOCH FROM (completed_at - created_at))/3600 AS numeric), 2) as hours,
|
||||
total_items, processed, succeeded, failed
|
||||
FROM background_jobs
|
||||
WHERE status IN ('completed', 'failed')
|
||||
ORDER BY completed_at DESC
|
||||
LIMIT 15`
|
||||
},
|
||||
{
|
||||
name: "8. Mass schedule breakdown by day of week",
|
||||
sql: `SELECT day_of_week,
|
||||
CASE day_of_week
|
||||
WHEN 0 THEN 'Sunday' WHEN 1 THEN 'Monday' WHEN 2 THEN 'Tuesday'
|
||||
WHEN 3 THEN 'Wednesday' WHEN 4 THEN 'Thursday' WHEN 5 THEN 'Friday'
|
||||
WHEN 6 THEN 'Saturday' ELSE 'Other'
|
||||
END as day_name,
|
||||
COUNT(*) as count
|
||||
FROM mass_schedules
|
||||
GROUP BY day_of_week
|
||||
ORDER BY day_of_week`
|
||||
},
|
||||
{
|
||||
name: "9. Churches with schedules by country (top 15)",
|
||||
sql: `SELECT c.country,
|
||||
COUNT(DISTINCT c.id) as total_churches,
|
||||
COUNT(DISTINCT ms.church_id) as churches_with_schedules,
|
||||
ROUND(100.0 * COUNT(DISTINCT ms.church_id) / NULLIF(COUNT(DISTINCT c.id), 0), 1) as coverage_pct,
|
||||
COUNT(ms.id) as total_schedule_rows
|
||||
FROM churches c
|
||||
LEFT JOIN mass_schedules ms ON ms.church_id = c.id
|
||||
GROUP BY c.country
|
||||
ORDER BY total_churches DESC
|
||||
LIMIT 15`
|
||||
},
|
||||
{
|
||||
name: "10. Enrichment sources - how churches were found",
|
||||
sql: `SELECT source, COUNT(*) as count
|
||||
FROM churches
|
||||
GROUP BY source
|
||||
ORDER BY count DESC`
|
||||
},
|
||||
{
|
||||
name: "11. Google Places enrichment impact",
|
||||
sql: `SELECT
|
||||
COUNT(*) FILTER (WHERE google_place_id IS NOT NULL) as has_google_place,
|
||||
COUNT(*) FILTER (WHERE google_place_id IS NOT NULL AND website IS NOT NULL) as google_with_website,
|
||||
COUNT(*) FILTER (WHERE google_place_id IS NULL) as no_google_place,
|
||||
COUNT(*) FILTER (WHERE google_searched_at IS NOT NULL) as google_searched,
|
||||
COUNT(*) FILTER (WHERE free_searched_at IS NOT NULL) as free_searched
|
||||
FROM churches`
|
||||
},
|
||||
{
|
||||
name: "12. Website presence by source",
|
||||
sql: `SELECT source,
|
||||
COUNT(*) as total,
|
||||
COUNT(*) FILTER (WHERE website IS NOT NULL) as has_website,
|
||||
ROUND(100.0 * COUNT(*) FILTER (WHERE website IS NOT NULL) / NULLIF(COUNT(*), 0), 1) as website_pct,
|
||||
COUNT(*) FILTER (WHERE google_place_id IS NOT NULL) as has_google_place,
|
||||
COUNT(*) FILTER (WHERE last_scraped_at IS NOT NULL) as scraped
|
||||
FROM churches
|
||||
GROUP BY source
|
||||
ORDER BY total DESC`
|
||||
}
|
||||
];
|
||||
|
||||
async function run() {
|
||||
await client.connect();
|
||||
|
||||
for (const q of queries) {
|
||||
console.log("=".repeat(90));
|
||||
console.log(q.name);
|
||||
console.log("=".repeat(90));
|
||||
try {
|
||||
const res = await client.query(q.sql);
|
||||
if (res.rows.length === 0) {
|
||||
console.log("(no rows returned)");
|
||||
} else {
|
||||
// Calculate column widths
|
||||
const cols = Object.keys(res.rows[0]);
|
||||
const widths = cols.map(c => {
|
||||
const maxData = Math.max(...res.rows.map(r => String(r[c] ?? "NULL").length));
|
||||
return Math.max(c.length, maxData);
|
||||
});
|
||||
|
||||
// Print header
|
||||
console.log(cols.map((c, i) => c.padEnd(widths[i])).join(" | "));
|
||||
console.log(widths.map(w => "-".repeat(w)).join("-+-"));
|
||||
|
||||
// Print rows
|
||||
for (const row of res.rows) {
|
||||
console.log(cols.map((c, i) => String(row[c] ?? "NULL").padEnd(widths[i])).join(" | "));
|
||||
}
|
||||
}
|
||||
console.log("\n(" + res.rows.length + " rows)\n");
|
||||
} catch (err) {
|
||||
console.log("ERROR:", err.message, "\n");
|
||||
}
|
||||
}
|
||||
|
||||
await client.end();
|
||||
}
|
||||
|
||||
run().catch(e => { console.error(e); process.exit(1); });
|
||||
48
scripts/debug/show-french-success.ts
Normal file
48
scripts/debug/show-french-success.ts
Normal file
@@ -0,0 +1,48 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Show detailed output from a successful French parse
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function showSuccess() {
|
||||
// One of our successful churches with 16 schedules
|
||||
const url = 'https://laportelatine.org/lieux/couvent-saint-francois-morgon';
|
||||
console.log(`Detailed parse of: ${url}\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('FR');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
console.log(`✅ Success: ${result.success}`);
|
||||
console.log(`📅 Schedules found: ${result.schedules.length}\n`);
|
||||
|
||||
// Group by day
|
||||
const byDay: Record<number, typeof result.schedules> = {};
|
||||
for (const sched of result.schedules) {
|
||||
if (!byDay[sched.dayOfWeek]) byDay[sched.dayOfWeek] = [];
|
||||
byDay[sched.dayOfWeek].push(sched);
|
||||
}
|
||||
|
||||
const dayNames = ['Dimanche', 'Lundi', 'Mardi', 'Mercredi', 'Jeudi', 'Vendredi', 'Samedi'];
|
||||
|
||||
console.log('═══════════════════════════════════════════════');
|
||||
console.log('PARSED SCHEDULE:');
|
||||
console.log('═══════════════════════════════════════════════\n');
|
||||
|
||||
Object.entries(byDay)
|
||||
.sort(([a], [b]) => parseInt(a) - parseInt(b))
|
||||
.forEach(([day, scheds]) => {
|
||||
console.log(`${dayNames[parseInt(day)]}:`);
|
||||
scheds.forEach(s => {
|
||||
console.log(` ${s.time} - ${s.language} ${s.massType}`);
|
||||
});
|
||||
console.log('');
|
||||
});
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
showSuccess().catch(console.error);
|
||||
28
scripts/debug/test-db-connection.ts
Normal file
28
scripts/debug/test-db-connection.ts
Normal file
@@ -0,0 +1,28 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Test database connection
|
||||
*/
|
||||
|
||||
import { config } from 'dotenv';
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
console.log('DATABASE_URL exists:', !!process.env.DATABASE_URL);
|
||||
console.log('DATABASE_URL value:', process.env.DATABASE_URL?.substring(0, 50) + '...');
|
||||
|
||||
import { prisma } from '../../src/lib/db';
|
||||
|
||||
async function testConnection() {
|
||||
try {
|
||||
const count = await prisma.church.count();
|
||||
console.log(`✅ Database connection successful!`);
|
||||
console.log(`Total churches in database: ${count}`);
|
||||
} catch (err: any) {
|
||||
console.log(`❌ Database connection failed:`);
|
||||
console.log(err.message);
|
||||
} finally {
|
||||
await prisma.$disconnect();
|
||||
}
|
||||
}
|
||||
|
||||
testConnection();
|
||||
180
scripts/debug/test-french-broader.ts
Normal file
180
scripts/debug/test-french-broader.ts
Normal file
@@ -0,0 +1,180 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Test more French churches and collect diagnostic data
|
||||
*/
|
||||
|
||||
import { config } from 'dotenv';
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
interface DiagnosticInfo {
|
||||
url: string;
|
||||
churchName: string;
|
||||
success: boolean;
|
||||
schedulesFound: number;
|
||||
hasFrenchDays: boolean;
|
||||
hasTimePatterns: boolean;
|
||||
timePatternsSample: string[];
|
||||
textSample: string;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
async function testFrenchBroader() {
|
||||
console.log('Testing 20 French churches with diagnostics...\n');
|
||||
|
||||
// Get more French churches
|
||||
const churches = await prisma.church.findMany({
|
||||
where: {
|
||||
country: 'FR',
|
||||
website: { not: null },
|
||||
source: 'osm',
|
||||
},
|
||||
take: 20,
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
|
||||
if (churches.length === 0) {
|
||||
console.log('No French churches found.');
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
return;
|
||||
}
|
||||
|
||||
console.log(`Found ${churches.length} French churches to test\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('FR');
|
||||
|
||||
let successCount = 0;
|
||||
let failCount = 0;
|
||||
const diagnostics: DiagnosticInfo[] = [];
|
||||
|
||||
for (let i = 0; i < churches.length; i++) {
|
||||
const church = churches[i];
|
||||
console.log(`[${i + 1}/${churches.length}] Testing: ${church.name} (${church.city || 'Unknown'})`);
|
||||
console.log(`URL: ${church.website}`);
|
||||
|
||||
try {
|
||||
const result = await scraper.scrape(church.website!);
|
||||
|
||||
// Extract diagnostics
|
||||
let hasFrenchDays = false;
|
||||
let hasTimePatterns = false;
|
||||
let timePatternsSample: string[] = [];
|
||||
let textSample = '';
|
||||
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
textSample = text.substring(0, 500);
|
||||
|
||||
const frenchDays = ['dimanche', 'lundi', 'mardi', 'mercredi', 'jeudi', 'vendredi', 'samedi'];
|
||||
hasFrenchDays = frenchDays.some(day => text.includes(day));
|
||||
|
||||
const timeRegex = /\d{1,2}[h:\.]\s*\d{0,2}\s*(?:h)?/g;
|
||||
const times = text.match(timeRegex);
|
||||
if (times) {
|
||||
hasTimePatterns = true;
|
||||
timePatternsSample = [...new Set(times)].slice(0, 10);
|
||||
}
|
||||
}
|
||||
|
||||
const diagnostic: DiagnosticInfo = {
|
||||
url: church.website!,
|
||||
churchName: church.name,
|
||||
success: result.success,
|
||||
schedulesFound: result.schedules.length,
|
||||
hasFrenchDays,
|
||||
hasTimePatterns,
|
||||
timePatternsSample,
|
||||
textSample,
|
||||
error: result.error,
|
||||
};
|
||||
|
||||
diagnostics.push(diagnostic);
|
||||
|
||||
if (result.success && result.schedules.length > 0) {
|
||||
successCount++;
|
||||
console.log(`✅ SUCCESS - ${result.schedules.length} schedules`);
|
||||
} else {
|
||||
failCount++;
|
||||
console.log(`❌ FAILED - ${result.error}`);
|
||||
if (hasFrenchDays && !hasTimePatterns) {
|
||||
console.log(` 💡 Has French days but no times`);
|
||||
} else if (!hasFrenchDays && hasTimePatterns) {
|
||||
console.log(` 💡 Has times but no French days`);
|
||||
} else if (hasFrenchDays && hasTimePatterns) {
|
||||
console.log(` 💡 Has BOTH days and times - parsing issue!`);
|
||||
console.log(` Sample times: ${timePatternsSample.slice(0, 5).join(', ')}`);
|
||||
} else {
|
||||
console.log(` 💡 No mass schedule content found`);
|
||||
}
|
||||
}
|
||||
console.log('');
|
||||
} catch (err: any) {
|
||||
failCount++;
|
||||
console.log(`❌ ERROR - ${err.message}\n`);
|
||||
diagnostics.push({
|
||||
url: church.website!,
|
||||
churchName: church.name,
|
||||
success: false,
|
||||
schedulesFound: 0,
|
||||
hasFrenchDays: false,
|
||||
hasTimePatterns: false,
|
||||
timePatternsSample: [],
|
||||
textSample: '',
|
||||
error: err.message,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
|
||||
// Analysis
|
||||
console.log('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━');
|
||||
console.log(`\nRESULTS: ${successCount}/${churches.length} successful (${((successCount / churches.length) * 100).toFixed(0)}%)`);
|
||||
console.log('');
|
||||
|
||||
const hasBoth = diagnostics.filter(d => !d.success && d.hasFrenchDays && d.hasTimePatterns);
|
||||
const hasDaysNoTimes = diagnostics.filter(d => !d.success && d.hasFrenchDays && !d.hasTimePatterns);
|
||||
const hasTimesNoDays = diagnostics.filter(d => !d.success && !d.hasFrenchDays && d.hasTimePatterns);
|
||||
const hasNeither = diagnostics.filter(d => !d.success && !d.hasFrenchDays && !d.hasTimePatterns);
|
||||
|
||||
console.log('FAILURE ANALYSIS:');
|
||||
console.log(` Has days + times but failed: ${hasBoth.length} (PARSING BUG)`);
|
||||
console.log(` Has days but no times: ${hasDaysNoTimes.length}`);
|
||||
console.log(` Has times but no days: ${hasTimesNoDays.length}`);
|
||||
console.log(` Has neither: ${hasNeither.length} (no mass schedule on page)`);
|
||||
console.log('');
|
||||
|
||||
if (hasBoth.length > 0) {
|
||||
console.log('⚠️ PARSING BUGS TO FIX (has both days and times but failed):');
|
||||
hasBoth.forEach(d => {
|
||||
console.log(` ${d.churchName}`);
|
||||
console.log(` URL: ${d.url}`);
|
||||
console.log(` Sample times found: ${d.timePatternsSample.slice(0, 5).join(', ')}`);
|
||||
console.log(` Text sample: ${d.textSample.substring(0, 150)}...`);
|
||||
console.log('');
|
||||
});
|
||||
}
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
testFrenchBroader().catch(console.error);
|
||||
100
scripts/debug/test-french-scraper.ts
Executable file
100
scripts/debug/test-french-scraper.ts
Executable file
@@ -0,0 +1,100 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Test international scraper against French churches
|
||||
*/
|
||||
|
||||
import { config } from 'dotenv';
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
async function testFrenchScraper() {
|
||||
console.log('Testing French church mass schedule scraping...\n');
|
||||
|
||||
// Get French churches with websites
|
||||
const churches = await prisma.church.findMany({
|
||||
where: {
|
||||
country: 'FR',
|
||||
website: { not: null },
|
||||
source: 'osm',
|
||||
},
|
||||
take: 5,
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
|
||||
if (churches.length === 0) {
|
||||
console.log('No French churches with websites found.');
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
return;
|
||||
}
|
||||
|
||||
console.log(`Found ${churches.length} French churches to test:\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('FR');
|
||||
|
||||
let successCount = 0;
|
||||
let failCount = 0;
|
||||
|
||||
for (const church of churches) {
|
||||
console.log(`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`);
|
||||
console.log(`Church: ${church.name}`);
|
||||
console.log(`City: ${church.city || 'Unknown'}`);
|
||||
console.log(`URL: ${church.website}`);
|
||||
console.log('');
|
||||
|
||||
try {
|
||||
const result = await scraper.scrape(church.website!);
|
||||
|
||||
if (result.success && result.schedules.length > 0) {
|
||||
successCount++;
|
||||
console.log(`✅ SUCCESS - Found ${result.schedules.length} schedules\n`);
|
||||
|
||||
// Group by day and show
|
||||
const byDay: Record<number, typeof result.schedules> = {};
|
||||
for (const sched of result.schedules) {
|
||||
if (!byDay[sched.dayOfWeek]) byDay[sched.dayOfWeek] = [];
|
||||
byDay[sched.dayOfWeek].push(sched);
|
||||
}
|
||||
|
||||
const dayNames = ['Dimanche', 'Lundi', 'Mardi', 'Mercredi', 'Jeudi', 'Vendredi', 'Samedi'];
|
||||
Object.entries(byDay).forEach(([day, scheds]) => {
|
||||
console.log(` ${dayNames[parseInt(day)]}:`);
|
||||
scheds.forEach(s => {
|
||||
console.log(` ${s.time} - ${s.language || 'Unknown'} (${s.massType || 'Mass'})`);
|
||||
});
|
||||
});
|
||||
console.log('');
|
||||
} else {
|
||||
failCount++;
|
||||
console.log(`❌ FAILED - ${result.error}`);
|
||||
console.log('');
|
||||
}
|
||||
} catch (err: any) {
|
||||
failCount++;
|
||||
console.log(`❌ ERROR - ${err.message}`);
|
||||
console.log('');
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
|
||||
console.log('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━');
|
||||
console.log(`\nRESULTS: ${successCount}/${churches.length} successful (${((successCount / churches.length) * 100).toFixed(0)}%)`);
|
||||
console.log(`Success: ${successCount}, Failed: ${failCount}\n`);
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
testFrenchScraper().catch(console.error);
|
||||
210
scripts/debug/test-international-sample.ts
Normal file
210
scripts/debug/test-international-sample.ts
Normal file
@@ -0,0 +1,210 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Test scraper on a diverse sample of international churches
|
||||
* to identify edge cases across different languages and formats
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
interface TestChurch {
|
||||
name: string;
|
||||
url: string;
|
||||
country: string;
|
||||
language: string;
|
||||
expectedDays?: string; // e.g., "Sun-Sat" or "Sun, Wed, Sat"
|
||||
notes?: string;
|
||||
}
|
||||
|
||||
// Sample churches from different countries/languages
|
||||
const testChurches: TestChurch[] = [
|
||||
// FRENCH
|
||||
{
|
||||
name: 'Saint-Étienne du Mont, Paris',
|
||||
url: 'https://www.saintetiennedumontparis.fr/',
|
||||
country: 'FR',
|
||||
language: 'French',
|
||||
notes: 'French format with "du lundi au vendredi"',
|
||||
},
|
||||
{
|
||||
name: 'Notre-Dame de la Garde, Marseille',
|
||||
url: 'https://www.notredamedelagarde.fr/',
|
||||
country: 'FR',
|
||||
language: 'French',
|
||||
notes: 'Major pilgrimage site',
|
||||
},
|
||||
|
||||
// GERMAN
|
||||
{
|
||||
name: 'St. Peter, Munich',
|
||||
url: 'https://www.alterpeter.de/',
|
||||
country: 'DE',
|
||||
language: 'German',
|
||||
notes: 'German format with "bis" for ranges',
|
||||
},
|
||||
{
|
||||
name: 'Kölner Dom, Cologne',
|
||||
url: 'https://www.koelner-dom.de/',
|
||||
country: 'DE',
|
||||
language: 'German',
|
||||
notes: 'Cathedral with Uhr time format',
|
||||
},
|
||||
|
||||
// SPANISH
|
||||
{
|
||||
name: 'Sagrada Família, Barcelona',
|
||||
url: 'https://sagradafamilia.org/',
|
||||
country: 'ES',
|
||||
language: 'Spanish',
|
||||
notes: 'Major tourist site, may have complex schedule',
|
||||
},
|
||||
{
|
||||
name: 'Parroquia San Miguel, Madrid',
|
||||
url: 'https://www.parroquiasanmiguel.es/',
|
||||
country: 'ES',
|
||||
language: 'Spanish',
|
||||
notes: 'Spanish format with "de lunes a viernes"',
|
||||
},
|
||||
|
||||
// PORTUGUESE
|
||||
{
|
||||
name: 'Basílica da Estrela, Lisbon',
|
||||
url: 'https://www.basilicadaestrela.com/',
|
||||
country: 'PT',
|
||||
language: 'Portuguese',
|
||||
notes: 'Portuguese format',
|
||||
},
|
||||
|
||||
// ITALIAN
|
||||
{
|
||||
name: 'Santa Maria Maggiore, Rome',
|
||||
url: 'https://www.vatican.va/various/basiliche/sm_maggiore/index_it.htm',
|
||||
country: 'IT',
|
||||
language: 'Italian',
|
||||
notes: 'Major basilica',
|
||||
},
|
||||
{
|
||||
name: 'Duomo di Milano',
|
||||
url: 'https://www.duomomilano.it/',
|
||||
country: 'IT',
|
||||
language: 'Italian',
|
||||
notes: 'Cathedral with Italian format',
|
||||
},
|
||||
|
||||
// DUTCH
|
||||
{
|
||||
name: 'Basiliek van de H. Nicolaas, Amsterdam',
|
||||
url: 'https://www.nicolaas-parochie.nl/',
|
||||
country: 'NL',
|
||||
language: 'Dutch',
|
||||
notes: 'Dutch format with "tot" for ranges',
|
||||
},
|
||||
|
||||
// CZECH
|
||||
{
|
||||
name: 'Chrám sv. Víta, Prague',
|
||||
url: 'https://www.katedralasvatehovita.cz/',
|
||||
country: 'CZ',
|
||||
language: 'Czech',
|
||||
notes: 'Czech format',
|
||||
},
|
||||
|
||||
// HUNGARIAN
|
||||
{
|
||||
name: 'Szent István Bazilika, Budapest',
|
||||
url: 'https://www.bazilika.biz/',
|
||||
country: 'HU',
|
||||
language: 'Hungarian',
|
||||
notes: 'Hungarian format',
|
||||
},
|
||||
|
||||
// More complex cases
|
||||
{
|
||||
name: 'Cathédrale Notre-Dame, Strasbourg',
|
||||
url: 'https://www.cathedrale-strasbourg.fr/',
|
||||
country: 'FR',
|
||||
language: 'French',
|
||||
notes: 'Bilingual region (French/German)',
|
||||
},
|
||||
];
|
||||
|
||||
async function testChurch(church: TestChurch, scraper: GenericScraper): Promise<void> {
|
||||
console.log(`\n${'='.repeat(80)}`);
|
||||
console.log(`📍 ${church.name}`);
|
||||
console.log(` ${church.url}`);
|
||||
console.log(` Language: ${church.language} | Country: ${church.country}`);
|
||||
if (church.notes) console.log(` Notes: ${church.notes}`);
|
||||
console.log(`${'='.repeat(80)}`);
|
||||
|
||||
try {
|
||||
scraper.setCountry(church.country);
|
||||
const result = await scraper.scrape(church.url);
|
||||
|
||||
if (!result.success) {
|
||||
console.log(`❌ FAILED: ${result.error || 'Unknown error'}`);
|
||||
return;
|
||||
}
|
||||
|
||||
if (result.schedules.length === 0) {
|
||||
console.log(`⚠️ SUCCESS but NO SCHEDULES found`);
|
||||
return;
|
||||
}
|
||||
|
||||
// Group by day
|
||||
const byDay: Record<number, typeof result.schedules> = {};
|
||||
for (const sched of result.schedules) {
|
||||
if (!byDay[sched.dayOfWeek]) byDay[sched.dayOfWeek] = [];
|
||||
byDay[sched.dayOfWeek].push(sched);
|
||||
}
|
||||
|
||||
const dayNames = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'];
|
||||
console.log(`\n✅ Found ${result.schedules.length} schedules:\n`);
|
||||
|
||||
for (let i = 0; i < 7; i++) {
|
||||
if (byDay[i]) {
|
||||
const times = byDay[i].map(s => {
|
||||
let str = s.time;
|
||||
if (s.massType) str += ` (${s.massType})`;
|
||||
if (s.language && s.language !== 'English') str += ` [${s.language}]`;
|
||||
return str;
|
||||
}).join(', ');
|
||||
console.log(` ${dayNames[i]}: ${times}`);
|
||||
}
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
console.log(`❌ ERROR: ${error instanceof Error ? error.message : String(error)}`);
|
||||
}
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
|
||||
console.log('🌍 INTERNATIONAL CHURCH SCRAPER TEST');
|
||||
console.log(`Testing ${testChurches.length} churches across ${new Set(testChurches.map(c => c.country)).size} countries`);
|
||||
|
||||
const results: { success: number; failed: number; noSchedules: number } = {
|
||||
success: 0,
|
||||
failed: 0,
|
||||
noSchedules: 0,
|
||||
};
|
||||
|
||||
for (const church of testChurches) {
|
||||
await testChurch(church, scraper);
|
||||
|
||||
// Brief delay between requests to be respectful
|
||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
|
||||
console.log(`\n${'='.repeat(80)}`);
|
||||
console.log('📊 SUMMARY');
|
||||
console.log(`${'='.repeat(80)}`);
|
||||
console.log(`Total tested: ${testChurches.length}`);
|
||||
console.log(`✅ Success with schedules: ${results.success}`);
|
||||
console.log(`⚠️ Success but no schedules: ${results.noSchedules}`);
|
||||
console.log(`❌ Failed: ${results.failed}`);
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
36
scripts/debug/test-masstimes-api.ts
Normal file
36
scripts/debug/test-masstimes-api.ts
Normal file
@@ -0,0 +1,36 @@
|
||||
/**
|
||||
* Quick test script to verify the masstimes.org JSON API scraper works
|
||||
* Usage: npx tsx scripts/test-masstimes-api.ts
|
||||
*/
|
||||
|
||||
import { writeFileSync } from 'fs';
|
||||
import { MassTimesScraper } from '../../src/lib/masstimes-scraper';
|
||||
|
||||
async function main() {
|
||||
console.log('Testing MassTimes.org JSON API Scraper\n');
|
||||
|
||||
const scraper = new MassTimesScraper();
|
||||
|
||||
try {
|
||||
await scraper.init();
|
||||
console.log('Browser initialized\n');
|
||||
|
||||
const lat = 34.852;
|
||||
const lng = -82.394;
|
||||
console.log(`Fetching churches near Greenville, SC (${lat}, ${lng})...\n`);
|
||||
|
||||
const churches = await scraper.scrapeByLocation(lat, lng);
|
||||
|
||||
const outPath = 'scraped-churches.json';
|
||||
writeFileSync(outPath, JSON.stringify(churches, null, 2));
|
||||
console.log(`\nSaved ${churches.length} churches to ${outPath}`);
|
||||
|
||||
} catch (error) {
|
||||
console.error('TEST FAILED:', error);
|
||||
process.exit(1);
|
||||
} finally {
|
||||
await scraper.close();
|
||||
}
|
||||
}
|
||||
|
||||
main();
|
||||
70
scripts/debug/test-polish-sections.ts
Normal file
70
scripts/debug/test-polish-sections.ts
Normal file
@@ -0,0 +1,70 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Test which sections are being created for Polish church
|
||||
*/
|
||||
|
||||
import { getDayNamesForCountry, buildDayPatterns } from '../../src/scrapers/i18n/day-names';
|
||||
|
||||
// Exact text from the page
|
||||
const text = `msze święte niedziela i uroczystości: 8 00 , 9 30 (lubojenka), 11 00 , 16 00 w lipcu i sierpniu nie ma mszy popołudniowej!--> dni powszednie: poniedziałek: godz. 8 00 wtorek - sobota: godz. 18 00`.toLowerCase();
|
||||
|
||||
console.log('Text:');
|
||||
console.log(text);
|
||||
console.log('\n');
|
||||
|
||||
const dayConfigs = getDayNamesForCountry('PL');
|
||||
const dayPatterns = buildDayPatterns(dayConfigs);
|
||||
const sortedDayNames = Object.keys(dayPatterns).sort((a, b) => b.length - a.length);
|
||||
const allDayNamesPattern = sortedDayNames.map(d => d.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')).join('|');
|
||||
|
||||
console.log('=== Testing individual day matching ===\n');
|
||||
|
||||
// Test niedziela specifically
|
||||
const niedziela = 'niedziela';
|
||||
const escaped = niedziela.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
|
||||
const regex = new RegExp(
|
||||
`(?:^|\\s|[,;:])${escaped}[:\\s]+([^]*?)(?=${allDayNamesPattern}|$)`,
|
||||
'i'
|
||||
);
|
||||
|
||||
const match = text.match(regex);
|
||||
if (match) {
|
||||
console.log(`✓ niedziela matched!`);
|
||||
console.log(` Full match: "${match[0].substring(0, 100)}"`);
|
||||
console.log(` Captured text: "${match[1].substring(0, 100)}"`);
|
||||
console.log('');
|
||||
|
||||
// Test if times can be extracted from captured text
|
||||
const spacePattern = /\b(\d{1,2})\s+(\d{2})(?!\d)/g;
|
||||
const times = match[1].match(spacePattern);
|
||||
console.log(` Times in captured text: ${times ? times.join(', ') : 'none'}`);
|
||||
} else {
|
||||
console.log(`✗ niedziela NOT matched`);
|
||||
console.log('');
|
||||
|
||||
// Try simpler regex
|
||||
const simpleRegex = /niedziela[:\s]+(.{0,100})/i;
|
||||
const simpleMatch = text.match(simpleRegex);
|
||||
if (simpleMatch) {
|
||||
console.log(`Simple regex matched: "${simpleMatch[1]}"`);
|
||||
}
|
||||
}
|
||||
|
||||
// Test poniedziałek
|
||||
console.log('\n=== Testing poniedziałek ===\n');
|
||||
|
||||
const ponieRegex = new RegExp(
|
||||
`(?:^|\\s|[,;:])poniedziałek[:\\s]+([^]*?)(?=${allDayNamesPattern}|$)`,
|
||||
'i'
|
||||
);
|
||||
|
||||
const ponieMatch = text.match(ponieRegex);
|
||||
if (ponieMatch) {
|
||||
console.log(`✓ poniedziałek matched!`);
|
||||
console.log(` Captured text: "${ponieMatch[1].substring(0, 100)}"`);
|
||||
|
||||
const times = ponieMatch[1].match(/\b(\d{1,2})\s+(\d{2})(?!\d)/g);
|
||||
console.log(` Times: ${times ? times.join(', ') : 'none'}`);
|
||||
} else {
|
||||
console.log(`✗ poniedziałek NOT matched`);
|
||||
}
|
||||
65
scripts/debug/test-polish-with-logging.ts
Normal file
65
scripts/debug/test-polish-with-logging.ts
Normal file
@@ -0,0 +1,65 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Test Polish church with detailed section logging
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
// Temporarily modify GenericScraper to add logging
|
||||
const originalParse = GenericScraper.prototype['parseSchedules'];
|
||||
GenericScraper.prototype['parseSchedules'] = function(html: string) {
|
||||
const text = html
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Call findScheduleSections and log result
|
||||
const sections = this['findScheduleSections'](text);
|
||||
|
||||
console.log('\n=== Sections found by findScheduleSections() ===\n');
|
||||
const dayNames = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'];
|
||||
sections.forEach((section: any, i: number) => {
|
||||
console.log(`Section ${i + 1}: ${dayNames[section.day]} (day ${section.day})`);
|
||||
console.log(` Text: "${section.text.substring(0, 80)}..."`);
|
||||
});
|
||||
console.log(`\nTotal sections: ${sections.length}\n`);
|
||||
|
||||
// Continue with normal processing
|
||||
return originalParse.call(this, html);
|
||||
};
|
||||
|
||||
async function testPolish() {
|
||||
const url = 'http://parafialubojna.pl';
|
||||
console.log(`Testing: ${url}`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('PL');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
console.log(`\nFinal result: ${result.success}`);
|
||||
console.log(`Schedules: ${result.schedules.length}\n`);
|
||||
|
||||
if (result.schedules.length > 0) {
|
||||
const byDay: Record<number, typeof result.schedules> = {};
|
||||
for (const sched of result.schedules) {
|
||||
if (!byDay[sched.dayOfWeek]) byDay[sched.dayOfWeek] = [];
|
||||
byDay[sched.dayOfWeek].push(sched);
|
||||
}
|
||||
|
||||
const dayNamesPL = ['Niedziela', 'Poniedziałek', 'Wtorek', 'Środa', 'Czwartek', 'Piątek', 'Sobota'];
|
||||
console.log('Parsed schedules by day:');
|
||||
for (let i = 0; i < 7; i++) {
|
||||
if (byDay[i]) {
|
||||
console.log(` ${dayNamesPL[i]}: ${byDay[i].map(s => s.time).join(', ')}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
testPolish().catch(console.error);
|
||||
49
scripts/debug/test-time-extraction.ts
Normal file
49
scripts/debug/test-time-extraction.ts
Normal file
@@ -0,0 +1,49 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Test which pattern is matching "00" time
|
||||
*/
|
||||
|
||||
// Test text from German church
|
||||
const testText = "10:00 uhr lateinisches amt";
|
||||
|
||||
const timePatterns = [
|
||||
{ name: '12-hour AM/PM', pattern: /(\d{1,2}):(\d{2})\s*(AM|PM|am|pm|a\.m\.|p\.m\.)/g },
|
||||
{ name: '12-hour no minutes', pattern: /(?<![:\d])(\d{1,2})\s*(AM|PM|am|pm|a\.m\.|p\.m\.)/g },
|
||||
{ name: '24-hour colon', pattern: /(?<![:\d\w])(\d{1,2}):(\d{2})(?!\s*(AM|PM|am|pm))/g },
|
||||
{ name: 'French/Portuguese h', pattern: /(?<![:\d\w])(\d{1,2})\s*h\s*(\d{2})?(?!\w)/gi },
|
||||
{ name: 'Italian period', pattern: /(?<![:\d\w])(\d{1,2})\.(\d{2})(?=\s|$|,|;|\)|\])/g },
|
||||
{ name: 'German Uhr (old)', pattern: /(\d{1,2})[:\.]?(\d{2})?\s*Uhr/gi },
|
||||
{ name: 'German Uhr (fixed)', pattern: /(?<![:\d])(\d{1,2})[:\.]?(\d{2})?\s*Uhr/gi },
|
||||
{ name: 'Polish space', pattern: /\b(\d{1,2})\s+(\d{2})(?!\d)/g },
|
||||
];
|
||||
|
||||
console.log(`Test text: "${testText}"\n`);
|
||||
|
||||
for (const { name, pattern } of timePatterns) {
|
||||
const matches = [...testText.matchAll(pattern)];
|
||||
if (matches.length > 0) {
|
||||
console.log(`✓ ${name}:`);
|
||||
for (const match of matches) {
|
||||
console.log(` Matched: "${match[0]}" at index ${match.index}`);
|
||||
}
|
||||
} else {
|
||||
console.log(`✗ ${name}: no match`);
|
||||
}
|
||||
}
|
||||
|
||||
// Now test with just "00 uhr"
|
||||
console.log(`\n${'='.repeat(60)}\n`);
|
||||
const testText2 = "00 uhr lateinisches";
|
||||
console.log(`Test text: "${testText2}"\n`);
|
||||
|
||||
for (const { name, pattern } of timePatterns) {
|
||||
const matches = [...testText2.matchAll(pattern)];
|
||||
if (matches.length > 0) {
|
||||
console.log(`✓ ${name}:`);
|
||||
for (const match of matches) {
|
||||
console.log(` Matched: "${match[0]}" at index ${match.index}`);
|
||||
}
|
||||
} else {
|
||||
console.log(`✗ ${name}: no match`);
|
||||
}
|
||||
}
|
||||
193
scripts/debug/test-top5-countries.ts
Normal file
193
scripts/debug/test-top5-countries.ts
Normal file
@@ -0,0 +1,193 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Quick test of top 5 priority countries
|
||||
*/
|
||||
|
||||
import { config } from 'dotenv';
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
const COUNTRIES = [
|
||||
{ code: 'FR', name: 'France' },
|
||||
{ code: 'DE', name: 'Germany' },
|
||||
{ code: 'ES', name: 'Spain' },
|
||||
{ code: 'PL', name: 'Poland' },
|
||||
{ code: 'BR', name: 'Brazil' },
|
||||
];
|
||||
|
||||
const PER_COUNTRY = 10;
|
||||
|
||||
interface CountryResult {
|
||||
country: string;
|
||||
countryName: string;
|
||||
tested: number;
|
||||
success: number;
|
||||
failed: number;
|
||||
successRate: number;
|
||||
hasBothButFailed: number; // Has days + times but parsing failed
|
||||
totalSchedules: number;
|
||||
sampleSuccess?: string;
|
||||
}
|
||||
|
||||
async function testTop5() {
|
||||
console.log('Testing top 5 priority countries (10 churches each)...\n');
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
|
||||
const results: CountryResult[] = [];
|
||||
|
||||
for (const country of COUNTRIES) {
|
||||
console.log(`\n${'='.repeat(60)}`);
|
||||
console.log(`Testing ${country.name} (${country.code})`);
|
||||
console.log('='.repeat(60));
|
||||
|
||||
const churches = await prisma.church.findMany({
|
||||
where: {
|
||||
country: country.code,
|
||||
website: { not: null },
|
||||
source: 'osm',
|
||||
},
|
||||
take: PER_COUNTRY,
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
|
||||
if (churches.length === 0) {
|
||||
console.log(`No churches with websites found for ${country.name}\n`);
|
||||
continue;
|
||||
}
|
||||
|
||||
scraper.setCountry(country.code);
|
||||
|
||||
let success = 0;
|
||||
let failed = 0;
|
||||
let hasBothButFailed = 0;
|
||||
let totalSchedules = 0;
|
||||
let sampleSuccess: string | undefined;
|
||||
|
||||
for (let i = 0; i < churches.length; i++) {
|
||||
const church = churches[i];
|
||||
process.stdout.write(`[${i + 1}/${churches.length}] ${church.name.substring(0, 40).padEnd(40)} `);
|
||||
|
||||
try {
|
||||
const result = await scraper.scrape(church.website!);
|
||||
|
||||
if (result.success && result.schedules.length > 0) {
|
||||
success++;
|
||||
totalSchedules += result.schedules.length;
|
||||
process.stdout.write(`✅ ${result.schedules.length} schedules\n`);
|
||||
|
||||
if (!sampleSuccess && result.schedules.length > 0) {
|
||||
sampleSuccess = `${church.name}: ${result.schedules.length} schedules`;
|
||||
}
|
||||
} else {
|
||||
failed++;
|
||||
process.stdout.write(`❌ ${result.error}\n`);
|
||||
|
||||
// Check if has both days and times (parsing bug indicator)
|
||||
if (result.rawHtml) {
|
||||
const text = result.rawHtml
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
|
||||
// Check for day names in any language
|
||||
const hasDays = text.match(/\b(sunday|monday|tuesday|wednesday|thursday|friday|saturday|dimanche|lundi|mardi|mercredi|jeudi|vendredi|samedi|sonntag|montag|dienstag|mittwoch|donnerstag|freitag|samstag|domingo|lunes|martes|miércoles|miercoles|jueves|viernes|sábado|sabado|niedziela|poniedziałek|poniedzialek|wtorek|środa|sroda|czwartek|piątek|piatek|sobota|segunda|terça|terca|quarta|quinta|sexta)\b/i);
|
||||
|
||||
const hasTimes = text.match(/\d{1,2}[h:\.]\s*\d{0,2}/);
|
||||
|
||||
if (hasDays && hasTimes) {
|
||||
hasBothButFailed++;
|
||||
process.stdout.write(` ⚠️ Has days + times but failed to parse\n`);
|
||||
}
|
||||
}
|
||||
}
|
||||
} catch (err: any) {
|
||||
failed++;
|
||||
process.stdout.write(`❌ ERROR: ${err.message}\n`);
|
||||
}
|
||||
}
|
||||
|
||||
const successRate = churches.length > 0 ? (success / churches.length) * 100 : 0;
|
||||
|
||||
results.push({
|
||||
country: country.code,
|
||||
countryName: country.name,
|
||||
tested: churches.length,
|
||||
success,
|
||||
failed,
|
||||
successRate,
|
||||
hasBothButFailed,
|
||||
totalSchedules,
|
||||
sampleSuccess,
|
||||
});
|
||||
|
||||
console.log(`\n${country.name} Summary: ${success}/${churches.length} (${successRate.toFixed(0)}%)`);
|
||||
console.log(` Total schedules extracted: ${totalSchedules}`);
|
||||
if (hasBothButFailed > 0) {
|
||||
console.log(` ⚠️ Parsing bugs: ${hasBothButFailed} (has content but failed to parse)`);
|
||||
}
|
||||
}
|
||||
|
||||
await scraper.close();
|
||||
|
||||
// Final summary
|
||||
console.log('\n\n');
|
||||
console.log('═'.repeat(80));
|
||||
console.log('FINAL RESULTS - TOP 5 COUNTRIES');
|
||||
console.log('═'.repeat(80));
|
||||
console.log('');
|
||||
console.log('Country | Tested | Success | Rate | Schedules | Bugs');
|
||||
console.log('─'.repeat(80));
|
||||
|
||||
const totalTested = results.reduce((sum, r) => sum + r.tested, 0);
|
||||
const totalSuccess = results.reduce((sum, r) => sum + r.success, 0);
|
||||
const totalSchedules = results.reduce((sum, r) => sum + r.totalSchedules, 0);
|
||||
const totalBugs = results.reduce((sum, r) => sum + r.hasBothButFailed, 0);
|
||||
|
||||
results.forEach(r => {
|
||||
const country = r.countryName.padEnd(12);
|
||||
const tested = String(r.tested).padStart(6);
|
||||
const success = String(r.success).padStart(7);
|
||||
const rate = `${r.successRate.toFixed(0)}%`.padStart(5);
|
||||
const schedules = String(r.totalSchedules).padStart(9);
|
||||
const bugs = r.hasBothButFailed > 0 ? `⚠️ ${r.hasBothButFailed}` : '✓';
|
||||
|
||||
console.log(`${country} | ${tested} | ${success} | ${rate} | ${schedules} | ${bugs}`);
|
||||
});
|
||||
|
||||
console.log('─'.repeat(80));
|
||||
const avgRate = totalTested > 0 ? (totalSuccess / totalTested) * 100 : 0;
|
||||
console.log(`OVERALL | ${String(totalTested).padStart(6)} | ${String(totalSuccess).padStart(7)} | ${avgRate.toFixed(0).padStart(4)}% | ${String(totalSchedules).padStart(9)} | ${totalBugs > 0 ? `⚠️ ${totalBugs}` : '✓'}`);
|
||||
console.log('');
|
||||
console.log('═'.repeat(80));
|
||||
console.log('');
|
||||
|
||||
if (totalBugs > 0) {
|
||||
console.log(`⚠️ ${totalBugs} parsing bugs detected (has days + times but failed)`);
|
||||
console.log(' These need investigation and fixes.\n');
|
||||
} else {
|
||||
console.log('✅ No parsing bugs! All failures are legitimate (no content or wrong page).\n');
|
||||
}
|
||||
|
||||
console.log(`Total churches tested: ${totalTested}`);
|
||||
console.log(`Total successful: ${totalSuccess} (${avgRate.toFixed(1)}%)`);
|
||||
console.log(`Total mass schedules extracted: ${totalSchedules}`);
|
||||
console.log('');
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
testTop5().catch(console.error);
|
||||
173
scripts/debug/test-website-scraper.ts
Normal file
173
scripts/debug/test-website-scraper.ts
Normal file
@@ -0,0 +1,173 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Test website scraper on churches with websites
|
||||
* Analyzes which websites can be scraped successfully
|
||||
*/
|
||||
|
||||
// Load .env
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
import fs from 'fs';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
interface TestResult {
|
||||
churchId: string;
|
||||
name: string;
|
||||
website: string;
|
||||
country: string;
|
||||
success: boolean;
|
||||
massesFound: number;
|
||||
schedules?: { dayOfWeek: number; time: string; massType?: string; language?: string }[];
|
||||
error?: string;
|
||||
}
|
||||
|
||||
function normalizeUrl(url: string): string {
|
||||
if (!url.startsWith('http://') && !url.startsWith('https://')) {
|
||||
return `https://${url}`;
|
||||
}
|
||||
return url;
|
||||
}
|
||||
|
||||
async function testScrapers(limit: number = 50, country?: string) {
|
||||
const results: TestResult[] = [];
|
||||
|
||||
// Get churches with websites
|
||||
const whereClause: any = {
|
||||
website: { not: null },
|
||||
};
|
||||
|
||||
if (country) {
|
||||
whereClause.country = country;
|
||||
}
|
||||
|
||||
const churches = await prisma.church.findMany({
|
||||
where: whereClause,
|
||||
take: limit,
|
||||
orderBy: { createdAt: 'desc' },
|
||||
});
|
||||
|
||||
console.log(`Testing ${churches.length} churches with websites...\n`);
|
||||
|
||||
// Initialize the scraper (launches Playwright browser)
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
|
||||
try {
|
||||
for (let i = 0; i < churches.length; i++) {
|
||||
const church = churches[i];
|
||||
const url = normalizeUrl(church.website!);
|
||||
console.log(`[${i + 1}/${churches.length}] Testing: ${church.name}`);
|
||||
console.log(` Website: ${url}`);
|
||||
|
||||
try {
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
results.push({
|
||||
churchId: church.id,
|
||||
name: church.name,
|
||||
website: url,
|
||||
country: church.country,
|
||||
success: result.success,
|
||||
massesFound: result.schedules.length,
|
||||
schedules: result.schedules.map((s) => ({
|
||||
dayOfWeek: s.dayOfWeek,
|
||||
time: s.time,
|
||||
massType: s.massType,
|
||||
language: s.language,
|
||||
})),
|
||||
error: result.error,
|
||||
});
|
||||
|
||||
if (result.success) {
|
||||
console.log(` ✓ ${result.schedules.length} masses found`);
|
||||
for (const s of result.schedules) {
|
||||
const days = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'];
|
||||
console.log(` ${days[s.dayOfWeek]} ${s.time} (${s.language || 'English'}${s.massType ? ', ' + s.massType : ''})`);
|
||||
}
|
||||
} else {
|
||||
console.log(` ✗ No masses found: ${result.error}`);
|
||||
}
|
||||
} catch (error: any) {
|
||||
console.log(` ✗ Error: ${error.message}`);
|
||||
results.push({
|
||||
churchId: church.id,
|
||||
name: church.name,
|
||||
website: url,
|
||||
country: church.country,
|
||||
success: false,
|
||||
massesFound: 0,
|
||||
error: error.message,
|
||||
});
|
||||
}
|
||||
|
||||
console.log('');
|
||||
}
|
||||
} finally {
|
||||
// Always close the browser
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
// Summary
|
||||
const successful = results.filter((r) => r.success);
|
||||
const failed = results.filter((r) => !r.success);
|
||||
const totalMasses = results.reduce((sum, r) => sum + r.massesFound, 0);
|
||||
|
||||
console.log('============================================================');
|
||||
console.log('Test Summary');
|
||||
console.log('============================================================');
|
||||
console.log(`Total churches tested: ${results.length}`);
|
||||
console.log(`Successful scrapes: ${successful.length} (${((successful.length / results.length) * 100).toFixed(1)}%)`);
|
||||
console.log(`Failed scrapes: ${failed.length} (${((failed.length / results.length) * 100).toFixed(1)}%)`);
|
||||
console.log(`Total masses found: ${totalMasses}`);
|
||||
console.log('============================================================');
|
||||
|
||||
if (failed.length > 0) {
|
||||
console.log('\nFailed websites:');
|
||||
for (const f of failed) {
|
||||
console.log(` - ${f.name}: ${f.website} (${f.error})`);
|
||||
}
|
||||
}
|
||||
|
||||
console.log('');
|
||||
|
||||
// Export results (without raw HTML to keep file manageable)
|
||||
fs.writeFileSync(
|
||||
'scraper-test-results.json',
|
||||
JSON.stringify(results, null, 2)
|
||||
);
|
||||
console.log('Results saved to scraper-test-results.json');
|
||||
|
||||
return results;
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
const limitIndex = args.indexOf('--limit');
|
||||
const countryIndex = args.indexOf('--country');
|
||||
|
||||
const limit = limitIndex !== -1 ? parseInt(args[limitIndex + 1]) : 50;
|
||||
const country = countryIndex !== -1 ? args[countryIndex + 1] : undefined;
|
||||
|
||||
console.log('============================================================');
|
||||
console.log('Website Scraper Testing');
|
||||
console.log('============================================================');
|
||||
console.log(`Limit: ${limit}`);
|
||||
console.log(`Country: ${country || 'All'}`);
|
||||
console.log('============================================================\n');
|
||||
|
||||
await testScrapers(limit, country);
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch(console.error);
|
||||
53
scripts/debug/verify-paz-schedules.ts
Normal file
53
scripts/debug/verify-paz-schedules.ts
Normal file
@@ -0,0 +1,53 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Verify Paróquia da Paz schedules are correctly parsed
|
||||
*/
|
||||
|
||||
import { GenericScraper } from '../../src/scrapers/strategies/generic';
|
||||
|
||||
async function verifyPazSchedules() {
|
||||
const url = 'https://www.paroquiadapaz.org.br/';
|
||||
console.log(`Verifying: ${url}\n`);
|
||||
|
||||
const scraper = new GenericScraper();
|
||||
await scraper.init();
|
||||
scraper.setCountry('BR');
|
||||
|
||||
const result = await scraper.scrape(url);
|
||||
|
||||
console.log(`✅ Success: ${result.success}`);
|
||||
console.log(`📅 Schedules found: ${result.schedules.length}\n`);
|
||||
|
||||
// Group by day
|
||||
const byDay: Record<number, typeof result.schedules> = {};
|
||||
for (const sched of result.schedules) {
|
||||
if (!byDay[sched.dayOfWeek]) byDay[sched.dayOfWeek] = [];
|
||||
byDay[sched.dayOfWeek].push(sched);
|
||||
}
|
||||
|
||||
const dayNames = ['Domingo', 'Segunda', 'Terça', 'Quarta', 'Quinta', 'Sexta', 'Sábado'];
|
||||
|
||||
console.log('═══════════════════════════════════════════════');
|
||||
console.log('PARSED SCHEDULE:');
|
||||
console.log('═══════════════════════════════════════════════\n');
|
||||
|
||||
Object.entries(byDay)
|
||||
.sort(([a], [b]) => parseInt(a) - parseInt(b))
|
||||
.forEach(([day, scheds]) => {
|
||||
console.log(`${dayNames[parseInt(day)]}:`);
|
||||
scheds.forEach(s => {
|
||||
console.log(` ${s.time} - ${s.language} ${s.massType}`);
|
||||
});
|
||||
console.log('');
|
||||
});
|
||||
|
||||
console.log('Expected schedule (from website):');
|
||||
console.log('Segunda, Terça, Quarta, Sexta: 16:00 e 18:00');
|
||||
console.log('Quinta: 16:00 e 19:00');
|
||||
console.log('Sábado: 08:00, 16:00 e 18:00');
|
||||
console.log('Domingo: 08:00, 11:00, 16:00, 18:00 e 20:00');
|
||||
|
||||
await scraper.close();
|
||||
}
|
||||
|
||||
verifyPazSchedules().catch(console.error);
|
||||
97
scripts/dedup-churches.ts
Normal file
97
scripts/dedup-churches.ts
Normal file
@@ -0,0 +1,97 @@
|
||||
/**
|
||||
* Find duplicate churches using ChromaDB semantic similarity.
|
||||
*
|
||||
* Usage:
|
||||
* npx tsx scripts/dedup-churches.ts # Dry run, show duplicates
|
||||
* npx tsx scripts/dedup-churches.ts --threshold 0.15 # Custom similarity threshold
|
||||
* npx tsx scripts/dedup-churches.ts --country US # Only check US churches
|
||||
* npx tsx scripts/dedup-churches.ts --limit 100 # Check first 100 churches
|
||||
*/
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { findSimilarChurches } from '../src/chromadb/queries';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
const args = process.argv.slice(2);
|
||||
const threshold = args.includes('--threshold')
|
||||
? parseFloat(args[args.indexOf('--threshold') + 1])
|
||||
: 0.15; // Cosine distance threshold (lower = more similar)
|
||||
const country = args.includes('--country')
|
||||
? args[args.indexOf('--country') + 1]
|
||||
: undefined;
|
||||
const limit = args.includes('--limit')
|
||||
? parseInt(args[args.indexOf('--limit') + 1])
|
||||
: 500;
|
||||
|
||||
async function main() {
|
||||
console.log(`Finding duplicate churches (threshold=${threshold}, country=${country || 'all'}, limit=${limit})`);
|
||||
console.log('---');
|
||||
|
||||
const churches = await prisma.church.findMany({
|
||||
take: limit,
|
||||
where: country ? { country } : undefined,
|
||||
orderBy: { name: 'asc' },
|
||||
select: {
|
||||
id: true,
|
||||
name: true,
|
||||
address: true,
|
||||
city: true,
|
||||
country: true,
|
||||
source: true,
|
||||
latitude: true,
|
||||
longitude: true,
|
||||
_count: { select: { massSchedules: true } },
|
||||
},
|
||||
});
|
||||
|
||||
console.log(`Checking ${churches.length} churches...\n`);
|
||||
|
||||
const seen = new Set<string>();
|
||||
let duplicateCount = 0;
|
||||
|
||||
for (const church of churches) {
|
||||
if (seen.has(church.id)) continue;
|
||||
|
||||
const text = `${church.name} ${church.address || ''} ${church.city || ''} ${church.country}`.trim();
|
||||
const similar = await findSimilarChurches(text, {
|
||||
country: church.country,
|
||||
nResults: 5,
|
||||
});
|
||||
|
||||
// Filter to matches within threshold, excluding self
|
||||
const matches = similar.filter(
|
||||
(s) => s.churchId !== church.id && s.distance <= threshold
|
||||
);
|
||||
|
||||
if (matches.length > 0) {
|
||||
duplicateCount++;
|
||||
console.log(`\nPotential duplicate #${duplicateCount}:`);
|
||||
console.log(` Original: "${church.name}" (${church.city || 'no city'}, ${church.country})`);
|
||||
console.log(` ID: ${church.id}, Source: ${church.source}, Schedules: ${church._count.massSchedules}`);
|
||||
console.log(` Lat/Lng: ${church.latitude}, ${church.longitude}`);
|
||||
|
||||
for (const match of matches) {
|
||||
console.log(` Match: "${match.document}" (distance: ${match.distance.toFixed(4)})`);
|
||||
console.log(` ID: ${match.churchId}`);
|
||||
seen.add(match.churchId);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
console.log(`\n---`);
|
||||
console.log(`Found ${duplicateCount} potential duplicate groups from ${churches.length} churches`);
|
||||
console.log(`Threshold: ${threshold} (lower = stricter matching)`);
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error(err);
|
||||
process.exit(1);
|
||||
});
|
||||
72
scripts/dedup-mass-schedules.ts
Normal file
72
scripts/dedup-mass-schedules.ts
Normal file
@@ -0,0 +1,72 @@
|
||||
#!/usr/bin/env tsx
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
||||
import { Pool } from 'pg';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
|
||||
interface CountResult {
|
||||
churches_with_dups: string;
|
||||
duplicate_rows: string;
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const dryRun = !process.argv.includes('--execute');
|
||||
|
||||
if (dryRun) {
|
||||
console.log('DRY RUN - pass --execute to actually delete duplicates\n');
|
||||
}
|
||||
|
||||
const client = await pool.connect();
|
||||
|
||||
try {
|
||||
const countResult = await client.query<CountResult>(`
|
||||
WITH ranked AS (
|
||||
SELECT church_id,
|
||||
ROW_NUMBER() OVER (
|
||||
PARTITION BY church_id, day_of_week, time, language
|
||||
ORDER BY created_at ASC
|
||||
) AS rn
|
||||
FROM mass_schedules
|
||||
WHERE is_active = true
|
||||
)
|
||||
SELECT COUNT(DISTINCT church_id) AS churches_with_dups,
|
||||
COUNT(*) AS duplicate_rows
|
||||
FROM ranked
|
||||
WHERE rn > 1;
|
||||
`);
|
||||
|
||||
const { churches_with_dups, duplicate_rows } = countResult.rows[0];
|
||||
console.log(`Churches with duplicate schedules: ${churches_with_dups}`);
|
||||
console.log(`Duplicate rows to ${dryRun ? 'delete' : 'delete'}: ${duplicate_rows}\n`);
|
||||
|
||||
if (!dryRun && Number(duplicate_rows) > 0) {
|
||||
console.log('Deleting duplicates (keeping oldest by created_at)...');
|
||||
|
||||
const deleteResult = await client.query(`
|
||||
WITH ranked AS (
|
||||
SELECT id,
|
||||
ROW_NUMBER() OVER (
|
||||
PARTITION BY church_id, day_of_week, time, language
|
||||
ORDER BY created_at ASC
|
||||
) AS rn
|
||||
FROM mass_schedules
|
||||
WHERE is_active = true
|
||||
)
|
||||
DELETE FROM mass_schedules
|
||||
WHERE id IN (SELECT id FROM ranked WHERE rn > 1);
|
||||
`);
|
||||
|
||||
console.log(`Deleted ${deleteResult.rowCount} duplicate mass schedule rows.`);
|
||||
}
|
||||
} finally {
|
||||
client.release();
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error('Fatal error:', err);
|
||||
process.exit(1);
|
||||
});
|
||||
25
scripts/deploy-local.sh
Executable file
25
scripts/deploy-local.sh
Executable file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
DEV_PATH="$HOME/Documents/ScraperControl"
|
||||
DOCKER_PATH="/opt/docker/scraper-control"
|
||||
|
||||
echo "Syncing dev → Docker deployment..."
|
||||
|
||||
rsync -avz \
|
||||
--exclude node_modules \
|
||||
--exclude .next \
|
||||
--exclude '.env*' \
|
||||
--exclude .git \
|
||||
--exclude .claude \
|
||||
--exclude .playwright-mcp \
|
||||
"$DEV_PATH/" "$DOCKER_PATH/"
|
||||
|
||||
echo "Restarting Docker services..."
|
||||
cd "$DOCKER_PATH"
|
||||
docker compose build app scheduler freesearch-enrichment
|
||||
docker compose up -d app scheduler freesearch-enrichment
|
||||
docker compose ps
|
||||
docker compose logs --tail 5 scheduler
|
||||
|
||||
echo "Deploy complete!"
|
||||
27
scripts/deploy-to-nas.sh
Executable file
27
scripts/deploy-to-nas.sh
Executable file
@@ -0,0 +1,27 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
NAS_HOST="albert@192.168.0.145"
|
||||
NAS_PATH="/volume1/docker/scraper-control"
|
||||
LOCAL_PATH="/Users/albert/Documents/Projects/Church/ScraperControl"
|
||||
|
||||
echo "Deploying ScraperControl to NAS..."
|
||||
|
||||
rsync -avz \
|
||||
--exclude 'node_modules' \
|
||||
--exclude '.next' \
|
||||
--exclude '.git' \
|
||||
--exclude '.env.local' \
|
||||
--exclude '*.log' \
|
||||
"$LOCAL_PATH/" \
|
||||
"$NAS_HOST:$NAS_PATH/"
|
||||
|
||||
echo "Rebuilding containers..."
|
||||
ssh "$NAS_HOST" << 'ENDSSH'
|
||||
cd /volume1/docker/scraper-control
|
||||
/usr/local/bin/docker compose build app scraper scheduler
|
||||
/usr/local/bin/docker compose up -d scheduler freesearch-enrichment
|
||||
/usr/local/bin/docker compose ps
|
||||
/usr/local/bin/docker compose logs --tail 5 scheduler
|
||||
ENDSSH
|
||||
|
||||
echo "Deployment complete!"
|
||||
226
scripts/enrich-with-forward-geocode.ts
Normal file
226
scripts/enrich-with-forward-geocode.ts
Normal file
@@ -0,0 +1,226 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Enrich churches that have lat/lng=0 with real coordinates via Nominatim forward geocoding.
|
||||
* After this runs, enrich-with-reverse-geocode fills city/state from the new coordinates.
|
||||
*
|
||||
* Usage:
|
||||
* npx tsx scripts/enrich-with-forward-geocode.ts --country HK --dry-run
|
||||
* npx tsx scripts/enrich-with-forward-geocode.ts --country HK
|
||||
* npx tsx scripts/enrich-with-forward-geocode.ts --limit 10
|
||||
*
|
||||
* Rate limit: 1 request/second (Nominatim usage policy — mandatory).
|
||||
*/
|
||||
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import axios from 'axios';
|
||||
|
||||
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
|
||||
const pool = new Pool({
|
||||
connectionString: dbUrl,
|
||||
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
|
||||
});
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
const NOMINATIM_SEARCH_URL = 'https://nominatim.openstreetmap.org/search';
|
||||
const RATE_LIMIT_MS = 1100;
|
||||
|
||||
// Some regions use a different ISO code in OSM than in our DB
|
||||
const NOMINATIM_COUNTRY_MAP: Record<string, string> = {
|
||||
HK: 'cn', // Hong Kong is part of China in OSM
|
||||
MO: 'cn', // Macau likewise
|
||||
};
|
||||
|
||||
interface ChurchRecord {
|
||||
id: string;
|
||||
name: string;
|
||||
address: string;
|
||||
country: string;
|
||||
city: string | null;
|
||||
state: string | null;
|
||||
}
|
||||
|
||||
interface NominatimSearchResult {
|
||||
lat: string;
|
||||
lon: string;
|
||||
display_name: string;
|
||||
address?: {
|
||||
city?: string;
|
||||
town?: string;
|
||||
village?: string;
|
||||
municipality?: string;
|
||||
state?: string;
|
||||
province?: string;
|
||||
};
|
||||
}
|
||||
|
||||
function log(msg: string) {
|
||||
console.log(`[${new Date().toISOString()}] ${msg}`);
|
||||
}
|
||||
|
||||
function sleep(ms: number): Promise<void> {
|
||||
return new Promise(resolve => setTimeout(resolve, ms));
|
||||
}
|
||||
|
||||
function cleanAddress(address: string): string {
|
||||
return address
|
||||
// Strip trailing city/region suffixes
|
||||
.replace(/,?\s*(H\.K\.|HK|Hong Kong|Kowloon|Kln\.|New Territories|N\.T\.|Lantau Island)\.?\s*$/i, '')
|
||||
// Strip "R.E." (Religious Education suffix used in HK addresses)
|
||||
.replace(/,?\s*R\.E\./i, '')
|
||||
.replace(/\.$/, '')
|
||||
.trim();
|
||||
}
|
||||
|
||||
/**
|
||||
* Fallback: strip any leading non-numeric institution name prefix and floor/unit designators,
|
||||
* returning just the street number onwards. Handles patterns like:
|
||||
* "Canossa School (H.K.) 8 Hoi Chak Street" → "8 Hoi Chak Street"
|
||||
* "G/F., Wai Ming Block, 111 Wing Hong Street" → "111 Wing Hong Street"
|
||||
* "3/F., Chi Wo Commercial Building, 20 Saigon Street" → "20 Saigon Street"
|
||||
*/
|
||||
function extractStreetAddress(address: string): string | null {
|
||||
// Find the first occurrence of a standalone number (house number)
|
||||
const match = address.match(/(?:^|,\s*)(\d+[A-Za-z]?(?:\s|,).*)/);
|
||||
if (!match) return null;
|
||||
const candidate = match[1].trim();
|
||||
// Must be meaningfully shorter than the full address to be worth retrying
|
||||
return candidate.length < address.length * 0.9 ? cleanAddress(candidate) : null;
|
||||
}
|
||||
|
||||
async function nominatimSearch(query: string, nominatimCountry: string): Promise<NominatimSearchResult | null> {
|
||||
const response = await axios.get(NOMINATIM_SEARCH_URL, {
|
||||
params: {
|
||||
q: query,
|
||||
format: 'json',
|
||||
limit: 1,
|
||||
countrycodes: nominatimCountry,
|
||||
addressdetails: 1,
|
||||
},
|
||||
headers: {
|
||||
'User-Agent': 'NearestMass/1.0 (privacy@nearestmass.com)',
|
||||
'Accept-Language': 'en',
|
||||
},
|
||||
timeout: 15000,
|
||||
});
|
||||
const results: NominatimSearchResult[] = response.data;
|
||||
return results.length > 0 ? results[0] : null;
|
||||
}
|
||||
|
||||
async function forwardGeocode(
|
||||
address: string,
|
||||
countryCode: string
|
||||
): Promise<{ result: NominatimSearchResult; usedFallback: boolean } | null> {
|
||||
const nominatimCountry = NOMINATIM_COUNTRY_MAP[countryCode] ?? countryCode.toLowerCase();
|
||||
const cleaned = cleanAddress(address);
|
||||
|
||||
const primary = await nominatimSearch(cleaned, nominatimCountry);
|
||||
if (primary) return { result: primary, usedFallback: false };
|
||||
|
||||
// Fallback: try just the street-number-onwards portion
|
||||
const streetOnly = extractStreetAddress(address);
|
||||
if (streetOnly && streetOnly !== cleaned) {
|
||||
await sleep(RATE_LIMIT_MS); // respect rate limit between retries
|
||||
const fallback = await nominatimSearch(streetOnly, nominatimCountry);
|
||||
if (fallback) return { result: fallback, usedFallback: true };
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
const dryRun = args.includes('--dry-run');
|
||||
const countryIdx = args.indexOf('--country');
|
||||
const limitIdx = args.indexOf('--limit');
|
||||
const countryCode = countryIdx !== -1 ? args[countryIdx + 1] : undefined;
|
||||
const limit = limitIdx !== -1 ? parseInt(args[limitIdx + 1], 10) : undefined;
|
||||
|
||||
log('============================================================');
|
||||
log('Nominatim Forward Geocode Enrichment');
|
||||
log('============================================================');
|
||||
log(`Country: ${countryCode || 'All'}`);
|
||||
log(`Limit: ${limit || 'No limit'}`);
|
||||
log(`Dry run: ${dryRun ? 'Yes' : 'No'}`);
|
||||
log('============================================================');
|
||||
|
||||
const churches = await prisma.church.findMany({
|
||||
where: {
|
||||
latitude: 0,
|
||||
longitude: 0,
|
||||
address: { not: null },
|
||||
...(countryCode ? { country: countryCode } : {}),
|
||||
},
|
||||
select: { id: true, name: true, address: true, country: true, city: true, state: true },
|
||||
orderBy: { createdAt: 'asc' },
|
||||
take: limit,
|
||||
}) as ChurchRecord[];
|
||||
|
||||
log(`Found ${churches.length} churches with lat/lng=0 and an address\n`);
|
||||
|
||||
const stats = { found: 0, notFound: 0, errors: 0 };
|
||||
|
||||
for (const church of churches) {
|
||||
try {
|
||||
const geocoded = await forwardGeocode(church.address, church.country);
|
||||
|
||||
if (!geocoded) {
|
||||
log(` - [NOT FOUND] ${church.name} | ${church.address}`);
|
||||
stats.notFound++;
|
||||
} else {
|
||||
const { result, usedFallback } = geocoded;
|
||||
const lat = parseFloat(result.lat);
|
||||
const lng = parseFloat(result.lon);
|
||||
const city = result.address?.city || result.address?.town ||
|
||||
result.address?.village || result.address?.municipality || null;
|
||||
const state = result.address?.state || result.address?.province || null;
|
||||
|
||||
log(` + [FOUND${usedFallback ? ' (fallback)' : ''}] ${church.name}`);
|
||||
log(` ${church.address}`);
|
||||
log(` → ${lat}, ${lng}${city ? ` (${city})` : ''}`);
|
||||
|
||||
if (!dryRun) {
|
||||
const updateData: Record<string, unknown> = { latitude: lat, longitude: lng };
|
||||
if (city && !church.city) updateData.city = city;
|
||||
if (state && !church.state) updateData.state = state;
|
||||
|
||||
await prisma.church.update({
|
||||
where: { id: church.id },
|
||||
data: updateData,
|
||||
});
|
||||
}
|
||||
|
||||
stats.found++;
|
||||
}
|
||||
} catch (err: any) {
|
||||
log(` ! [ERROR] ${church.name}: ${err.message}`);
|
||||
stats.errors++;
|
||||
}
|
||||
|
||||
await sleep(RATE_LIMIT_MS);
|
||||
}
|
||||
|
||||
log('');
|
||||
log('============================================================');
|
||||
log('Forward Geocode Summary');
|
||||
log('============================================================');
|
||||
log(`Found coords: ${stats.found}`);
|
||||
log(`Not found: ${stats.notFound}`);
|
||||
log(`Errors: ${stats.errors}`);
|
||||
log('============================================================');
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch(err => {
|
||||
console.error('Fatal error:', err);
|
||||
process.exit(1);
|
||||
});
|
||||
1371
scripts/enrich-with-freesearch.ts
Normal file
1371
scripts/enrich-with-freesearch.ts
Normal file
File diff suppressed because it is too large
Load Diff
408
scripts/enrich-with-google-places.ts
Normal file
408
scripts/enrich-with-google-places.ts
Normal file
@@ -0,0 +1,408 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Enrich OSM churches with Google Places data (website, phone, email)
|
||||
*
|
||||
* Usage:
|
||||
* npx tsx scripts/enrich-with-google-places.ts --limit 10 --dry-run
|
||||
* npx tsx scripts/enrich-with-google-places.ts --country BR --limit 100
|
||||
* npx tsx scripts/enrich-with-google-places.ts --all
|
||||
*
|
||||
* Rate Limiting:
|
||||
* - Free tier: $200/month credit
|
||||
* - Text Search: ~$17 per 1000 requests
|
||||
* - $200 / $17 = ~11,764 requests per month
|
||||
* - ~390 churches per day to stay within free tier
|
||||
* - Script uses 2-second delay between requests (max 1,800/hour)
|
||||
*/
|
||||
|
||||
// Load .env for database connection
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
// Use DATABASE_URL from .env (works for both local dev and NAS/production)
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import axios from 'axios';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
const GOOGLE_PLACES_API_KEY = process.env.GOOGLE_PLACES_API_KEY;
|
||||
const PLACES_API_URL = 'https://places.googleapis.com/v1/places:searchText';
|
||||
const RATE_LIMIT_MS = 2000; // 2 seconds between requests
|
||||
|
||||
// --- Job Tracking ---
|
||||
async function createOrResumeJob(args: string[]): Promise<string | null> {
|
||||
const jobIdIndex = args.indexOf('--job-id');
|
||||
if (jobIdIndex !== -1) {
|
||||
const jobId = args[jobIdIndex + 1];
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: { status: 'running', startedAt: new Date() },
|
||||
});
|
||||
return jobId;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
async function createNewJob(config: Record<string, unknown>): Promise<string> {
|
||||
const job = await prisma.backgroundJob.create({
|
||||
data: {
|
||||
type: 'google-enrichment',
|
||||
status: 'running',
|
||||
startedAt: new Date(),
|
||||
config: config as any,
|
||||
},
|
||||
});
|
||||
return job.id;
|
||||
}
|
||||
|
||||
async function updateJobProgress(jobId: string, processed: number, succeeded: number, failed: number, itemsFound: number, totalItems: number): Promise<void> {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: { processed, succeeded, failed, itemsFound, totalItems },
|
||||
});
|
||||
}
|
||||
|
||||
async function checkJobStopping(jobId: string): Promise<boolean> {
|
||||
const job = await prisma.backgroundJob.findUnique({ where: { id: jobId } });
|
||||
return job?.status === 'stopping';
|
||||
}
|
||||
|
||||
async function completeJob(jobId: string, error?: string): Promise<void> {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: {
|
||||
status: error ? 'failed' : 'completed',
|
||||
error,
|
||||
completedAt: new Date(),
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Country priority order — largest OSM church counts first, since those
|
||||
* have the most un-enriched churches. Covers all countries from the
|
||||
* CATHOLIC_COUNTRIES lists in import-osm-churches.ts.
|
||||
*/
|
||||
const COUNTRY_PRIORITY = [
|
||||
// Top tier: 5000+ OSM churches
|
||||
'FR', 'IT', 'ES', 'DE', 'PL', 'BR',
|
||||
// High tier: 1000-5000
|
||||
'PT', 'AT', 'BE', 'CZ', 'PH', 'HU', 'US', 'MX', 'HR', 'GB',
|
||||
'CR', 'SK', 'EC', 'CH', 'AR', 'CA', 'CO', 'NL', 'IE', 'IN',
|
||||
'SI', 'AU',
|
||||
// Medium tier: 100-1000
|
||||
'PE', 'RO', 'KR', 'CL', 'ID', 'LT', 'BO', 'VN', 'BA', 'BY',
|
||||
'UA', 'VE', 'HN', 'UG', 'CD', 'GT', 'CU', 'SV', 'NI', 'PA',
|
||||
'DO', 'CN', 'JP', 'LV', 'RS', 'TZ', 'KE', 'AL', 'RU',
|
||||
// Lower tier: remaining countries
|
||||
'LU', 'MT', 'NZ', 'PG', 'FJ', 'NC', 'PF', 'UY', 'PY', 'HT',
|
||||
'CM', 'RW', 'BI', 'MG', 'MW', 'ZM', 'ZW', 'MZ', 'AO', 'NG',
|
||||
'BJ', 'TG', 'CI', 'BF', 'ML', 'NE', 'SN', 'GN', 'LR', 'SL',
|
||||
'GH', 'GA', 'CG', 'CF', 'TD', 'SD', 'ET', 'ER', 'SO',
|
||||
'TL', 'MY', 'SG', 'TH', 'LA', 'KH', 'MM', 'LK', 'BD', 'PK',
|
||||
'LB', 'IL', 'PS', 'JO', 'SY', 'IQ',
|
||||
'GF', 'SR', 'GY', 'BS', 'BB', 'JM', 'TT', 'GD', 'LC', 'VC',
|
||||
'AG', 'DM', 'KN', 'MC', 'SM', 'VA', 'LI', 'AD',
|
||||
'RS', 'MK', 'EE', 'GE', 'AM',
|
||||
'NA', 'BW', 'LS', 'SZ', 'MU', 'SC', 'KM', 'CV', 'ST', 'GQ',
|
||||
'DJ', 'GM', 'BT', 'NP', 'AF', 'KZ', 'UZ', 'TM', 'TJ', 'KG',
|
||||
'MN', 'BN', 'MV', 'WS', 'TO', 'VU', 'SB', 'KI', 'NR', 'TV',
|
||||
'FM', 'MH', 'PW',
|
||||
];
|
||||
|
||||
interface GooglePlacesResult {
|
||||
found: boolean;
|
||||
website?: string;
|
||||
phone?: string;
|
||||
placeId?: string;
|
||||
}
|
||||
|
||||
interface EnrichmentStats {
|
||||
processed: number;
|
||||
enriched: number;
|
||||
notFound: number;
|
||||
errors: number;
|
||||
websitesAdded: number;
|
||||
phonesAdded: number;
|
||||
}
|
||||
|
||||
async function searchGooglePlaces(
|
||||
name: string,
|
||||
city: string | null,
|
||||
state: string | null,
|
||||
latitude: number,
|
||||
longitude: number
|
||||
): Promise<GooglePlacesResult> {
|
||||
if (!GOOGLE_PLACES_API_KEY) {
|
||||
throw new Error('GOOGLE_PLACES_API_KEY not set in environment');
|
||||
}
|
||||
|
||||
// Build search query
|
||||
const location = [city, state].filter(Boolean).join(', ');
|
||||
const textQuery = `${name} ${location}`.trim();
|
||||
|
||||
try {
|
||||
const response = await axios.post(
|
||||
PLACES_API_URL,
|
||||
{
|
||||
textQuery,
|
||||
locationBias: {
|
||||
circle: {
|
||||
center: {
|
||||
latitude,
|
||||
longitude,
|
||||
},
|
||||
radius: 500, // 500 meters
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'X-Goog-Api-Key': GOOGLE_PLACES_API_KEY,
|
||||
'X-Goog-FieldMask': 'places.id,places.displayName,places.websiteUri,places.nationalPhoneNumber',
|
||||
},
|
||||
}
|
||||
);
|
||||
|
||||
if (response.data.places && response.data.places.length > 0) {
|
||||
const place = response.data.places[0]; // Take first result
|
||||
return {
|
||||
found: true,
|
||||
website: place.websiteUri || undefined,
|
||||
phone: place.nationalPhoneNumber || undefined,
|
||||
placeId: place.id || undefined,
|
||||
};
|
||||
}
|
||||
|
||||
return { found: false };
|
||||
} catch (error: any) {
|
||||
if (error.response?.status === 429) {
|
||||
console.error('Rate limited by Google Places API');
|
||||
throw new Error('RATE_LIMITED');
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
async function enrichChurches(
|
||||
countryCode?: string,
|
||||
limit?: number,
|
||||
dryRun: boolean = false,
|
||||
jobId?: string | null
|
||||
): Promise<EnrichmentStats> {
|
||||
const stats: EnrichmentStats = {
|
||||
processed: 0,
|
||||
enriched: 0,
|
||||
notFound: 0,
|
||||
errors: 0,
|
||||
websitesAdded: 0,
|
||||
phonesAdded: 0,
|
||||
};
|
||||
|
||||
let churches;
|
||||
|
||||
if (countryCode) {
|
||||
// Manual override: process specific country
|
||||
console.log(`Manual mode: Processing country ${countryCode}`);
|
||||
churches = await prisma.church.findMany({
|
||||
where: {
|
||||
source: 'osm',
|
||||
googleSearchedAt: null,
|
||||
country: countryCode,
|
||||
},
|
||||
take: limit,
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
} else {
|
||||
// Priority mode: sequential through countries (exhaust each before moving on)
|
||||
console.log('Priority mode: Processing countries sequentially');
|
||||
console.log(`Top priority countries: ${COUNTRY_PRIORITY.slice(0, 10).join(', ')}...\n`);
|
||||
|
||||
churches = [];
|
||||
const targetTotal = limit || 390;
|
||||
|
||||
for (const country of COUNTRY_PRIORITY) {
|
||||
if (churches.length >= targetTotal) break;
|
||||
|
||||
const remaining = targetTotal - churches.length;
|
||||
const batch = await prisma.church.findMany({
|
||||
where: {
|
||||
source: 'osm',
|
||||
googleSearchedAt: null,
|
||||
country,
|
||||
},
|
||||
take: remaining,
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
|
||||
if (batch.length > 0) {
|
||||
churches.push(...batch);
|
||||
console.log(` Queued ${batch.length} churches from ${country}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
console.log(`\nFound ${churches.length} churches to enrich`);
|
||||
console.log('');
|
||||
|
||||
for (const church of churches) {
|
||||
stats.processed++;
|
||||
|
||||
try {
|
||||
console.log(`[${stats.processed}/${churches.length}] ${church.name} (${church.city}, ${church.state})`);
|
||||
|
||||
const result = await searchGooglePlaces(
|
||||
church.name,
|
||||
church.city,
|
||||
church.state,
|
||||
church.latitude,
|
||||
church.longitude
|
||||
);
|
||||
|
||||
if (result.found) {
|
||||
console.log(' ✓ Found on Google Places');
|
||||
|
||||
if (result.website) {
|
||||
console.log(` Website: ${result.website}`);
|
||||
stats.websitesAdded++;
|
||||
}
|
||||
|
||||
if (result.phone) {
|
||||
console.log(` Phone: ${result.phone}`);
|
||||
stats.phonesAdded++;
|
||||
}
|
||||
|
||||
if (!dryRun) {
|
||||
await prisma.church.update({
|
||||
where: { id: church.id },
|
||||
data: {
|
||||
website: result.website || church.website,
|
||||
phone: result.phone || church.phone,
|
||||
googlePlaceId: result.placeId || church.googlePlaceId,
|
||||
hasWebsite: !!(result.website || church.website),
|
||||
googleSearchedAt: new Date(),
|
||||
},
|
||||
});
|
||||
if (result.website || result.phone) {
|
||||
stats.enriched++;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
console.log(' ✗ Not found on Google Places');
|
||||
stats.notFound++;
|
||||
|
||||
// Mark as attempted so we don't re-query this church
|
||||
if (!dryRun) {
|
||||
await prisma.church.update({
|
||||
where: { id: church.id },
|
||||
data: { googleSearchedAt: new Date() },
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Rate limiting
|
||||
await new Promise((resolve) => setTimeout(resolve, RATE_LIMIT_MS));
|
||||
} catch (error: any) {
|
||||
stats.errors++;
|
||||
if (error.message === 'RATE_LIMITED') {
|
||||
console.error(' ⚠ Rate limited, stopping enrichment');
|
||||
break;
|
||||
}
|
||||
console.error(` ✗ Error: ${error.message}`);
|
||||
}
|
||||
|
||||
// Job tracking: update progress every 10 items and check for stop
|
||||
if (jobId && stats.processed % 10 === 0) {
|
||||
await updateJobProgress(jobId, stats.processed, stats.enriched, stats.errors, stats.enriched, churches.length);
|
||||
const stopping = await checkJobStopping(jobId);
|
||||
if (stopping) {
|
||||
console.log('\nJob stop requested via admin dashboard.');
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Progress update every 50 churches
|
||||
if (stats.processed % 50 === 0) {
|
||||
console.log('');
|
||||
console.log(`Progress: ${stats.processed}/${churches.length} processed`);
|
||||
console.log(` Enriched: ${stats.enriched}, Not found: ${stats.notFound}, Errors: ${stats.errors}`);
|
||||
console.log('');
|
||||
}
|
||||
}
|
||||
|
||||
// Final job update
|
||||
if (jobId) {
|
||||
await updateJobProgress(jobId, stats.processed, stats.enriched, stats.errors, stats.enriched, churches.length);
|
||||
}
|
||||
|
||||
return stats;
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
const countryIndex = args.indexOf('--country');
|
||||
const limitIndex = args.indexOf('--limit');
|
||||
const dryRun = args.includes('--dry-run');
|
||||
const all = args.includes('--all');
|
||||
|
||||
const countryCode = countryIndex !== -1 ? args[countryIndex + 1] : undefined;
|
||||
const limit = all ? undefined : limitIndex !== -1 ? parseInt(args[limitIndex + 1]) : 10;
|
||||
|
||||
if (!GOOGLE_PLACES_API_KEY) {
|
||||
console.error('Error: GOOGLE_PLACES_API_KEY not set in environment');
|
||||
console.error('Add it to your .env file');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
console.log('============================================================');
|
||||
console.log('Google Places Church Enrichment');
|
||||
console.log('============================================================');
|
||||
console.log(`Country: ${countryCode || 'All'}`);
|
||||
console.log(`Limit: ${limit || 'No limit'}`);
|
||||
console.log(`Dry run: ${dryRun ? 'Yes' : 'No'}`);
|
||||
console.log('============================================================');
|
||||
console.log('');
|
||||
|
||||
|
||||
|
||||
// Job tracking
|
||||
let jobId = await createOrResumeJob(args);
|
||||
if (!jobId && !dryRun) {
|
||||
jobId = await createNewJob({ countryCode, limit, dryRun });
|
||||
}
|
||||
if (jobId) console.log(`Job ID: ${jobId}\n`);
|
||||
|
||||
const stats = await enrichChurches(countryCode, limit, dryRun, jobId);
|
||||
|
||||
console.log('');
|
||||
console.log('============================================================');
|
||||
console.log('Enrichment Summary');
|
||||
console.log('============================================================');
|
||||
console.log(`Churches processed: ${stats.processed}`);
|
||||
console.log(`Churches enriched: ${stats.enriched}`);
|
||||
console.log(`Not found on Google: ${stats.notFound}`);
|
||||
console.log(`Websites added: ${stats.websitesAdded}`);
|
||||
console.log(`Phone numbers added: ${stats.phonesAdded}`);
|
||||
console.log(`Errors encountered: ${stats.errors}`);
|
||||
console.log('============================================================');
|
||||
|
||||
// Complete job
|
||||
if (jobId) {
|
||||
await completeJob(jobId);
|
||||
}
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
console.error('Fatal error:', error);
|
||||
process.exit(1);
|
||||
});
|
||||
624
scripts/enrich-with-reverse-geocode.ts
Normal file
624
scripts/enrich-with-reverse-geocode.ts
Normal file
@@ -0,0 +1,624 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Enrich churches with city/state/zip via Nominatim reverse geocoding (OSM)
|
||||
*
|
||||
* Usage:
|
||||
* npx tsx scripts/enrich-with-reverse-geocode.ts --country FR --limit 10 --dry-run
|
||||
* npx tsx scripts/enrich-with-reverse-geocode.ts --country FR --continuous
|
||||
* npx tsx scripts/enrich-with-reverse-geocode.ts --continuous
|
||||
*
|
||||
* Rate limit: 1 request/second (Nominatim usage policy — mandatory).
|
||||
* Full pass of ~193K churches in ~2 days.
|
||||
*/
|
||||
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import axios from 'axios';
|
||||
|
||||
// Fresh DB connection (not cached singleton)
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
const NOMINATIM_URL = 'https://nominatim.openstreetmap.org/reverse';
|
||||
const RATE_LIMIT_MS = 1100; // Slightly over 1s to stay safe
|
||||
const BATCH_SIZE = 50;
|
||||
const PROGRESS_INTERVAL = 10;
|
||||
|
||||
// --- Job Tracking ---
|
||||
|
||||
async function createOrResumeJob(args: string[]): Promise<string | null> {
|
||||
const jobIdIndex = args.indexOf('--job-id');
|
||||
if (jobIdIndex !== -1) {
|
||||
const jobId = args[jobIdIndex + 1];
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: { status: 'running', startedAt: new Date() },
|
||||
});
|
||||
return jobId;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
async function createNewJob(config: Record<string, unknown>): Promise<string> {
|
||||
const job = await prisma.backgroundJob.create({
|
||||
data: {
|
||||
type: 'reverse-geocode-enrichment',
|
||||
status: 'running',
|
||||
startedAt: new Date(),
|
||||
config,
|
||||
},
|
||||
});
|
||||
return job.id;
|
||||
}
|
||||
|
||||
async function updateJobProgress(jobId: string, stats: EnrichmentStats, totalItems: number): Promise<void> {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: {
|
||||
processed: stats.processed,
|
||||
succeeded: stats.enriched,
|
||||
failed: stats.errors,
|
||||
itemsFound: stats.enriched,
|
||||
totalItems,
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
async function checkJobStopping(jobId: string): Promise<boolean> {
|
||||
const job = await prisma.backgroundJob.findUnique({ where: { id: jobId } });
|
||||
return job?.status === 'stopping';
|
||||
}
|
||||
|
||||
async function completeJob(jobId: string, error?: string): Promise<void> {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: {
|
||||
status: error ? 'failed' : 'completed',
|
||||
error,
|
||||
completedAt: new Date(),
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
// --- Types ---
|
||||
|
||||
interface ChurchRecord {
|
||||
id: string;
|
||||
name: string;
|
||||
address: string | null;
|
||||
city: string | null;
|
||||
state: string | null;
|
||||
zip: string | null;
|
||||
country: string;
|
||||
latitude: number;
|
||||
longitude: number;
|
||||
}
|
||||
|
||||
interface NominatimAddress {
|
||||
house_number?: string;
|
||||
road?: string;
|
||||
city?: string;
|
||||
town?: string;
|
||||
village?: string;
|
||||
municipality?: string;
|
||||
hamlet?: string;
|
||||
suburb?: string;
|
||||
neighbourhood?: string;
|
||||
state?: string;
|
||||
province?: string;
|
||||
postcode?: string;
|
||||
country_code?: string;
|
||||
}
|
||||
|
||||
interface NominatimResponse {
|
||||
display_name?: string;
|
||||
address?: NominatimAddress;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
interface EnrichmentStats {
|
||||
processed: number;
|
||||
enriched: number;
|
||||
noCity: number;
|
||||
errors: number;
|
||||
skippedExisting: number;
|
||||
cycles: number;
|
||||
startTime: number;
|
||||
}
|
||||
|
||||
// --- Circuit Breaker ---
|
||||
|
||||
class CircuitBreaker {
|
||||
private failures = 0;
|
||||
private isOpen = false;
|
||||
private backoffMs = 60000; // Start at 60s for Nominatim
|
||||
private readonly maxBackoffMs = 300000; // 5 minutes
|
||||
private readonly threshold = 5;
|
||||
|
||||
async checkAndWait(): Promise<boolean> {
|
||||
if (!this.isOpen) return true;
|
||||
|
||||
log(`Circuit breaker open. Waiting ${Math.round(this.backoffMs / 1000)}s before retry...`);
|
||||
await sleep(this.backoffMs);
|
||||
|
||||
// Try a test request
|
||||
try {
|
||||
const resp = await axios.get(NOMINATIM_URL, {
|
||||
params: { lat: 48.8566, lon: 2.3522, format: 'json' },
|
||||
headers: { 'User-Agent': 'NearestMass/1.0 (privacy@nearestmass.com)' },
|
||||
timeout: 10000,
|
||||
});
|
||||
if (resp.status === 200) {
|
||||
this.reset();
|
||||
log('Circuit breaker closed: Nominatim is back');
|
||||
return true;
|
||||
}
|
||||
} catch {
|
||||
// Still down
|
||||
}
|
||||
|
||||
this.backoffMs = Math.min(this.backoffMs * 2, this.maxBackoffMs);
|
||||
return false;
|
||||
}
|
||||
|
||||
recordFailure() {
|
||||
this.failures++;
|
||||
if (this.failures >= this.threshold && !this.isOpen) {
|
||||
this.isOpen = true;
|
||||
this.backoffMs = 60000;
|
||||
log(`Circuit breaker OPEN after ${this.failures} consecutive failures`);
|
||||
}
|
||||
}
|
||||
|
||||
reset() {
|
||||
if (this.failures > 0 || this.isOpen) {
|
||||
this.failures = 0;
|
||||
this.isOpen = false;
|
||||
this.backoffMs = 60000;
|
||||
}
|
||||
}
|
||||
|
||||
get opened() { return this.isOpen; }
|
||||
}
|
||||
|
||||
// --- Helpers ---
|
||||
|
||||
let shuttingDown = false;
|
||||
|
||||
function log(msg: string) {
|
||||
console.log(`[${new Date().toISOString()}] ${msg}`);
|
||||
}
|
||||
|
||||
function logError(msg: string) {
|
||||
console.error(`[${new Date().toISOString()}] ${msg}`);
|
||||
}
|
||||
|
||||
function sleep(ms: number): Promise<void> {
|
||||
return new Promise(resolve => {
|
||||
const timer = setTimeout(resolve, ms);
|
||||
const check = setInterval(() => {
|
||||
if (shuttingDown) {
|
||||
clearTimeout(timer);
|
||||
clearInterval(check);
|
||||
resolve();
|
||||
}
|
||||
}, 1000);
|
||||
setTimeout(() => clearInterval(check), ms + 100);
|
||||
});
|
||||
}
|
||||
|
||||
// --- Nominatim API ---
|
||||
|
||||
async function reverseGeocode(lat: number, lng: number): Promise<NominatimResponse> {
|
||||
const response = await axios.get(NOMINATIM_URL, {
|
||||
params: {
|
||||
lat,
|
||||
lon: lng,
|
||||
format: 'json',
|
||||
zoom: 16,
|
||||
addressdetails: 1,
|
||||
},
|
||||
headers: {
|
||||
'User-Agent': 'NearestMass/1.0 (privacy@nearestmass.com)',
|
||||
'Accept-Language': 'en',
|
||||
},
|
||||
timeout: 15000,
|
||||
});
|
||||
return response.data;
|
||||
}
|
||||
|
||||
function extractCity(address: NominatimAddress): string | null {
|
||||
return address.city || address.town || address.village ||
|
||||
address.municipality || address.hamlet || null;
|
||||
}
|
||||
|
||||
function extractState(address: NominatimAddress): string | null {
|
||||
return address.state || address.province || null;
|
||||
}
|
||||
|
||||
function extractAddress(address: NominatimAddress): string | null {
|
||||
const parts: string[] = [];
|
||||
if (address.house_number) parts.push(address.house_number);
|
||||
if (address.road) parts.push(address.road);
|
||||
if (parts.length === 0) return null;
|
||||
return parts.join(' ');
|
||||
}
|
||||
|
||||
// --- Database Queries ---
|
||||
|
||||
async function getNextBatch(
|
||||
batchSize: number,
|
||||
countryCode?: string,
|
||||
): Promise<ChurchRecord[]> {
|
||||
return prisma.church.findMany({
|
||||
where: {
|
||||
city: null,
|
||||
latitude: { not: undefined },
|
||||
longitude: { not: undefined },
|
||||
reverseGeocodedAt: null,
|
||||
...(countryCode ? { country: countryCode } : {}),
|
||||
},
|
||||
select: {
|
||||
id: true, name: true, address: true, city: true, state: true, zip: true,
|
||||
country: true, latitude: true, longitude: true,
|
||||
},
|
||||
take: batchSize,
|
||||
orderBy: [
|
||||
{ country: 'asc' },
|
||||
{ createdAt: 'asc' },
|
||||
],
|
||||
});
|
||||
}
|
||||
|
||||
async function getTotalRemaining(countryCode?: string): Promise<number> {
|
||||
return prisma.church.count({
|
||||
where: {
|
||||
city: null,
|
||||
latitude: { not: undefined },
|
||||
longitude: { not: undefined },
|
||||
reverseGeocodedAt: null,
|
||||
...(countryCode ? { country: countryCode } : {}),
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
// --- Main Processing ---
|
||||
|
||||
async function processChurch(
|
||||
church: ChurchRecord,
|
||||
stats: EnrichmentStats,
|
||||
dryRun: boolean,
|
||||
): Promise<void> {
|
||||
const label = `${church.name} (${church.country})`;
|
||||
|
||||
try {
|
||||
const result = await reverseGeocode(church.latitude, church.longitude);
|
||||
|
||||
if (result.error || !result.address) {
|
||||
log(` - [${stats.processed}] ${label} => no address data`);
|
||||
stats.noCity++;
|
||||
if (!dryRun) {
|
||||
await prisma.church.update({
|
||||
where: { id: church.id },
|
||||
data: { reverseGeocodedAt: new Date() },
|
||||
});
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
const address = extractAddress(result.address);
|
||||
const city = extractCity(result.address);
|
||||
const state = extractState(result.address);
|
||||
const zip = result.address.postcode || null;
|
||||
|
||||
if (city) {
|
||||
const addrStr = address ? `${address}, ` : '';
|
||||
log(` + [${stats.processed}] ${label} => ${addrStr}${city}, ${state || '?'}`);
|
||||
stats.enriched++;
|
||||
} else {
|
||||
log(` - [${stats.processed}] ${label} => no city in response`);
|
||||
stats.noCity++;
|
||||
}
|
||||
|
||||
if (!dryRun) {
|
||||
const updateData: Record<string, unknown> = {
|
||||
reverseGeocodedAt: new Date(),
|
||||
};
|
||||
// Only update fields that are currently null
|
||||
if (address && !church.address) updateData.address = address;
|
||||
if (city && !church.city) updateData.city = city;
|
||||
if (state && !church.state) updateData.state = state;
|
||||
if (zip && !church.zip) updateData.zip = zip;
|
||||
// Update country if currently unknown (XX) and Nominatim returned one
|
||||
const countryCodeResult = result.address.country_code?.toUpperCase();
|
||||
if (church.country === 'XX' && countryCodeResult && countryCodeResult !== 'XX') {
|
||||
updateData.country = countryCodeResult;
|
||||
}
|
||||
|
||||
await prisma.church.update({
|
||||
where: { id: church.id },
|
||||
data: updateData,
|
||||
});
|
||||
}
|
||||
} catch (error: any) {
|
||||
stats.errors++;
|
||||
|
||||
// Handle rate limiting (429)
|
||||
if (error.response?.status === 429) {
|
||||
logError(` ! [${stats.processed}] ${label} => rate limited (429), backing off...`);
|
||||
await sleep(5000); // Extra 5s backoff
|
||||
throw error;
|
||||
}
|
||||
|
||||
// Handle server errors (5xx)
|
||||
if (error.response?.status >= 500) {
|
||||
logError(` ! [${stats.processed}] ${label} => server error (${error.response.status})`);
|
||||
throw error;
|
||||
}
|
||||
|
||||
logError(` ! [${stats.processed}] ${label} => ${error.message}`);
|
||||
// Don't throw for non-retriable errors (just mark as attempted)
|
||||
if (!dryRun) {
|
||||
await prisma.church.update({
|
||||
where: { id: church.id },
|
||||
data: { reverseGeocodedAt: new Date() },
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async function runSinglePass(
|
||||
stats: EnrichmentStats,
|
||||
countryCode?: string,
|
||||
limit?: number,
|
||||
dryRun: boolean = false,
|
||||
jobId?: string | null,
|
||||
): Promise<void> {
|
||||
let totalProcessed = 0;
|
||||
const circuitBreaker = new CircuitBreaker();
|
||||
|
||||
while (!shuttingDown) {
|
||||
if (limit && totalProcessed >= limit) break;
|
||||
|
||||
// Circuit breaker check
|
||||
if (circuitBreaker.opened) {
|
||||
const ok = await circuitBreaker.checkAndWait();
|
||||
if (!ok) continue;
|
||||
}
|
||||
|
||||
const batchLimit = limit
|
||||
? Math.min(BATCH_SIZE, limit - totalProcessed)
|
||||
: BATCH_SIZE;
|
||||
|
||||
const churches = await getNextBatch(batchLimit, countryCode);
|
||||
if (churches.length === 0) break;
|
||||
|
||||
for (const church of churches) {
|
||||
if (shuttingDown) break;
|
||||
if (limit && totalProcessed >= limit) break;
|
||||
|
||||
stats.processed++;
|
||||
totalProcessed++;
|
||||
|
||||
try {
|
||||
await processChurch(church, stats, dryRun);
|
||||
circuitBreaker.reset();
|
||||
} catch (error: any) {
|
||||
circuitBreaker.recordFailure();
|
||||
// Already logged in processChurch
|
||||
}
|
||||
|
||||
// Rate limit: 1 request per second
|
||||
if (!shuttingDown) {
|
||||
await sleep(RATE_LIMIT_MS);
|
||||
}
|
||||
|
||||
// Job tracking: update progress every PROGRESS_INTERVAL items
|
||||
if (jobId && stats.processed % PROGRESS_INTERVAL === 0) {
|
||||
await updateJobProgress(jobId, stats, 0);
|
||||
const stopping = await checkJobStopping(jobId);
|
||||
if (stopping) {
|
||||
log('Job stop requested via admin dashboard.');
|
||||
shuttingDown = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Progress logging
|
||||
if (stats.processed % 100 === 0) {
|
||||
const elapsed = (Date.now() - stats.startTime) / 1000;
|
||||
const rate = Math.round((stats.processed / elapsed) * 3600);
|
||||
const enrichRate = stats.processed > 0
|
||||
? ((stats.enriched / stats.processed) * 100).toFixed(1)
|
||||
: '0.0';
|
||||
log(`Progress: ${stats.processed} processed, ${stats.enriched} enriched, ${stats.noCity} no-city, ${stats.errors} errors`);
|
||||
log(` Enrich rate: ${enrichRate}%, Rate: ~${rate}/hour`);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async function runContinuous(
|
||||
stats: EnrichmentStats,
|
||||
countryCode?: string,
|
||||
jobId?: string | null,
|
||||
): Promise<void> {
|
||||
log('Running in continuous mode. Press Ctrl+C to stop.');
|
||||
const circuitBreaker = new CircuitBreaker();
|
||||
|
||||
while (!shuttingDown) {
|
||||
stats.cycles++;
|
||||
log(`--- Cycle ${stats.cycles} ---`);
|
||||
let processedInCycle = 0;
|
||||
|
||||
while (!shuttingDown) {
|
||||
// Circuit breaker check
|
||||
if (circuitBreaker.opened) {
|
||||
const ok = await circuitBreaker.checkAndWait();
|
||||
if (!ok) continue;
|
||||
}
|
||||
|
||||
const churches = await getNextBatch(BATCH_SIZE, countryCode);
|
||||
if (churches.length === 0) break;
|
||||
|
||||
for (const church of churches) {
|
||||
if (shuttingDown) break;
|
||||
|
||||
stats.processed++;
|
||||
processedInCycle++;
|
||||
|
||||
try {
|
||||
await processChurch(church, stats, false);
|
||||
circuitBreaker.reset();
|
||||
} catch {
|
||||
circuitBreaker.recordFailure();
|
||||
}
|
||||
|
||||
// Rate limit
|
||||
if (!shuttingDown) {
|
||||
await sleep(RATE_LIMIT_MS);
|
||||
}
|
||||
|
||||
// Job tracking
|
||||
if (jobId && stats.processed % PROGRESS_INTERVAL === 0) {
|
||||
await updateJobProgress(jobId, stats, 0);
|
||||
const stopping = await checkJobStopping(jobId);
|
||||
if (stopping) {
|
||||
log('Job stop requested via admin dashboard.');
|
||||
shuttingDown = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Progress logging
|
||||
if (stats.processed % 100 === 0) {
|
||||
const elapsed = (Date.now() - stats.startTime) / 1000;
|
||||
const rate = Math.round((stats.processed / elapsed) * 3600);
|
||||
log(`Progress: ${stats.processed} processed, ${stats.enriched} enriched, ${stats.noCity} no-city, ${stats.errors} errors (~${rate}/hour)`);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (shuttingDown) break;
|
||||
|
||||
if (processedInCycle === 0) {
|
||||
log('No churches needing reverse geocoding. Waiting 1 hour...');
|
||||
for (let i = 0; i < 360 && !shuttingDown; i++) {
|
||||
await sleep(10000);
|
||||
}
|
||||
} else {
|
||||
log(`Cycle ${stats.cycles} complete. ${processedInCycle} churches processed. Brief pause...`);
|
||||
await sleep(10000);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// --- Main ---
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
const countryIndex = args.indexOf('--country');
|
||||
const limitIndex = args.indexOf('--limit');
|
||||
const dryRun = args.includes('--dry-run');
|
||||
const continuous = args.includes('--continuous');
|
||||
|
||||
const countryCode = countryIndex !== -1 ? args[countryIndex + 1] : undefined;
|
||||
const limit = limitIndex !== -1 ? parseInt(args[limitIndex + 1]) : undefined;
|
||||
|
||||
// Graceful shutdown
|
||||
process.on('SIGTERM', () => {
|
||||
log('Received SIGTERM, finishing current request...');
|
||||
shuttingDown = true;
|
||||
});
|
||||
process.on('SIGINT', () => {
|
||||
log('Received SIGINT, finishing current request...');
|
||||
shuttingDown = true;
|
||||
});
|
||||
|
||||
log('============================================================');
|
||||
log('Nominatim Reverse Geocode Enrichment');
|
||||
log('============================================================');
|
||||
log(`Mode: ${continuous ? 'Continuous' : 'Single pass'}`);
|
||||
log(`Country: ${countryCode || 'All'}`);
|
||||
log(`Limit: ${limit || 'No limit'}`);
|
||||
log(`Dry run: ${dryRun ? 'Yes' : 'No'}`);
|
||||
log(`Rate limit: ${RATE_LIMIT_MS}ms between requests`);
|
||||
log('============================================================');
|
||||
|
||||
// Count remaining
|
||||
const remaining = await getTotalRemaining(countryCode);
|
||||
log(`Churches needing reverse geocoding: ${remaining}`);
|
||||
const estimatedHours = (remaining * RATE_LIMIT_MS / 1000 / 3600).toFixed(1);
|
||||
log(`Estimated time: ~${estimatedHours} hours @ 1 req/sec`);
|
||||
|
||||
if (remaining === 0) {
|
||||
log('Nothing to do!');
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
return;
|
||||
}
|
||||
|
||||
// Job tracking
|
||||
let jobId = await createOrResumeJob(args);
|
||||
if (!jobId) {
|
||||
jobId = await createNewJob({ countryCode, limit, continuous, dryRun });
|
||||
}
|
||||
log(`Job ID: ${jobId}`);
|
||||
|
||||
const stats: EnrichmentStats = {
|
||||
processed: 0,
|
||||
enriched: 0,
|
||||
noCity: 0,
|
||||
errors: 0,
|
||||
skippedExisting: 0,
|
||||
cycles: 0,
|
||||
startTime: Date.now(),
|
||||
};
|
||||
|
||||
if (continuous) {
|
||||
await runContinuous(stats, countryCode, jobId);
|
||||
} else {
|
||||
await runSinglePass(stats, countryCode, limit, dryRun, jobId);
|
||||
}
|
||||
|
||||
// Complete job
|
||||
if (jobId) {
|
||||
await updateJobProgress(jobId, stats, 0);
|
||||
await completeJob(jobId);
|
||||
}
|
||||
|
||||
// Print summary
|
||||
const elapsed = ((Date.now() - stats.startTime) / 1000).toFixed(1);
|
||||
const enrichRate = stats.processed > 0
|
||||
? ((stats.enriched / stats.processed) * 100).toFixed(1)
|
||||
: '0.0';
|
||||
|
||||
log('');
|
||||
log('============================================================');
|
||||
log('Reverse Geocode Enrichment Summary');
|
||||
log('============================================================');
|
||||
log(`Churches processed: ${stats.processed}`);
|
||||
log(`Cities found: ${stats.enriched}`);
|
||||
log(`No city in response: ${stats.noCity}`);
|
||||
log(`Errors: ${stats.errors}`);
|
||||
log(`Enrich rate: ${enrichRate}%`);
|
||||
log(`Elapsed: ${elapsed}s`);
|
||||
if (stats.cycles > 0) {
|
||||
log(`Cycles completed: ${stats.cycles}`);
|
||||
}
|
||||
log('============================================================');
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
logError(`Fatal error: ${error.message}`);
|
||||
process.exit(1);
|
||||
});
|
||||
328
scripts/enrich-with-wikidata.ts
Normal file
328
scripts/enrich-with-wikidata.ts
Normal file
@@ -0,0 +1,328 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Enrich churches with website URLs from Wikidata
|
||||
*
|
||||
* Queries Wikidata SPARQL endpoint for Catholic churches that have official websites,
|
||||
* then matches them to existing churches in the database via proximity + name matching.
|
||||
*
|
||||
* Usage:
|
||||
* npx tsx scripts/enrich-with-wikidata.ts --dry-run
|
||||
* npx tsx scripts/enrich-with-wikidata.ts --execute
|
||||
* npx tsx scripts/enrich-with-wikidata.ts --execute --country DE
|
||||
* npx tsx scripts/enrich-with-wikidata.ts --job-id <uuid>
|
||||
*/
|
||||
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import axios from 'axios';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
const WIKIDATA_SPARQL_URL = 'https://query.wikidata.org/sparql';
|
||||
const MATCH_RADIUS_KM = 1.0; // Max distance for matching
|
||||
const BATCH_SIZE = 500; // SPARQL results per query
|
||||
|
||||
function log(msg: string) {
|
||||
console.log(`[${new Date().toISOString()}] ${msg}`);
|
||||
}
|
||||
|
||||
function logError(msg: string) {
|
||||
console.error(`[${new Date().toISOString()}] ${msg}`);
|
||||
}
|
||||
|
||||
// Haversine distance in km
|
||||
function haversineKm(lat1: number, lon1: number, lat2: number, lon2: number): number {
|
||||
const R = 6371;
|
||||
const dLat = (lat2 - lat1) * Math.PI / 180;
|
||||
const dLon = (lon2 - lon1) * Math.PI / 180;
|
||||
const a = Math.sin(dLat / 2) ** 2 +
|
||||
Math.cos(lat1 * Math.PI / 180) * Math.cos(lat2 * Math.PI / 180) *
|
||||
Math.sin(dLon / 2) ** 2;
|
||||
return R * 2 * Math.asin(Math.sqrt(a));
|
||||
}
|
||||
|
||||
function normalizeForMatch(str: string): string {
|
||||
return str.toLowerCase()
|
||||
.normalize('NFD').replace(/[\u0300-\u036f]/g, '') // strip accents
|
||||
.replace(/[^a-z0-9\s]/g, '')
|
||||
.replace(/\s+/g, ' ')
|
||||
.trim();
|
||||
}
|
||||
|
||||
interface WikidataChurch {
|
||||
label: string;
|
||||
website: string;
|
||||
lat: number;
|
||||
lon: number;
|
||||
wikidataId: string;
|
||||
}
|
||||
|
||||
async function queryWikidata(country?: string, offset = 0): Promise<WikidataChurch[]> {
|
||||
// SPARQL query for Catholic churches with websites
|
||||
let countryFilter = '';
|
||||
if (country) {
|
||||
// Map ISO alpha-2 to Wikidata country item
|
||||
const countryMap: Record<string, string> = {
|
||||
DE: 'Q183', FR: 'Q142', ES: 'Q29', IT: 'Q38', PL: 'Q36',
|
||||
PT: 'Q45', BR: 'Q155', NL: 'Q55', CZ: 'Q213', HU: 'Q28',
|
||||
AT: 'Q40', BE: 'Q31', CH: 'Q39', IE: 'Q27', GB: 'Q145',
|
||||
US: 'Q30', CA: 'Q16', MX: 'Q96', AR: 'Q414', CO: 'Q739',
|
||||
HR: 'Q224', SK: 'Q214', SI: 'Q215',
|
||||
};
|
||||
const qid = countryMap[country];
|
||||
if (qid) {
|
||||
countryFilter = `?church wdt:P17 wd:${qid} .`;
|
||||
}
|
||||
}
|
||||
|
||||
const sparql = `
|
||||
SELECT ?church ?churchLabel ?website ?lat ?lon WHERE {
|
||||
?church wdt:P31/wdt:P279* wd:Q16970 .
|
||||
?church wdt:P140 wd:Q9592 .
|
||||
?church wdt:P856 ?website .
|
||||
?church p:P625 ?coordStatement .
|
||||
?coordStatement ps:P625 ?coord .
|
||||
BIND(geof:latitude(?coord) AS ?lat)
|
||||
BIND(geof:longitude(?coord) AS ?lon)
|
||||
${countryFilter}
|
||||
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,fr,es,it,pt,pl,nl,cs,hu" . }
|
||||
}
|
||||
ORDER BY ?church
|
||||
LIMIT ${BATCH_SIZE}
|
||||
OFFSET ${offset}
|
||||
`;
|
||||
|
||||
const response = await axios.get(WIKIDATA_SPARQL_URL, {
|
||||
params: { query: sparql, format: 'json' },
|
||||
headers: {
|
||||
'User-Agent': 'NearestMass/1.0 (https://nearestmass.com; contact: privacy@nearestmass.com)',
|
||||
'Accept': 'application/sparql-results+json',
|
||||
},
|
||||
timeout: 60000,
|
||||
});
|
||||
|
||||
const bindings = response.data?.results?.bindings || [];
|
||||
return bindings.map((b: any) => ({
|
||||
label: b.churchLabel?.value || '',
|
||||
website: b.website?.value || '',
|
||||
lat: parseFloat(b.lat?.value || '0'),
|
||||
lon: parseFloat(b.lon?.value || '0'),
|
||||
wikidataId: b.church?.value?.replace('http://www.wikidata.org/entity/', '') || '',
|
||||
}));
|
||||
}
|
||||
|
||||
interface MatchResult {
|
||||
churchId: string;
|
||||
churchName: string;
|
||||
distance: number;
|
||||
nameScore: number;
|
||||
}
|
||||
|
||||
async function findMatch(wdChurch: WikidataChurch): Promise<MatchResult | null> {
|
||||
// Find nearby churches without a website
|
||||
const nearby = await prisma.church.findMany({
|
||||
where: {
|
||||
website: null,
|
||||
latitude: { gte: wdChurch.lat - 0.01, lte: wdChurch.lat + 0.01 },
|
||||
longitude: { gte: wdChurch.lon - 0.01, lte: wdChurch.lon + 0.01 },
|
||||
},
|
||||
select: { id: true, name: true, latitude: true, longitude: true },
|
||||
take: 20,
|
||||
});
|
||||
|
||||
if (nearby.length === 0) return null;
|
||||
|
||||
// Score each candidate
|
||||
const wdNameNorm = normalizeForMatch(wdChurch.label);
|
||||
const wdWords = wdNameNorm.split(' ').filter(w => w.length >= 3);
|
||||
|
||||
let bestMatch: MatchResult | null = null;
|
||||
|
||||
for (const church of nearby) {
|
||||
const dist = haversineKm(wdChurch.lat, wdChurch.lon, church.latitude, church.longitude);
|
||||
if (dist > MATCH_RADIUS_KM) continue;
|
||||
|
||||
const churchNameNorm = normalizeForMatch(church.name);
|
||||
const churchWords = churchNameNorm.split(' ').filter(w => w.length >= 3);
|
||||
|
||||
// Count matching words
|
||||
let matchingWords = 0;
|
||||
for (const w of wdWords) {
|
||||
if (churchWords.includes(w)) matchingWords++;
|
||||
}
|
||||
|
||||
const nameScore = wdWords.length > 0 ? matchingWords / wdWords.length : 0;
|
||||
|
||||
// Require at least 50% word overlap or distance < 100m
|
||||
if (nameScore < 0.5 && dist > 0.1) continue;
|
||||
|
||||
if (!bestMatch || nameScore > bestMatch.nameScore ||
|
||||
(nameScore === bestMatch.nameScore && dist < bestMatch.distance)) {
|
||||
bestMatch = {
|
||||
churchId: church.id,
|
||||
churchName: church.name,
|
||||
distance: dist,
|
||||
nameScore,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
return bestMatch;
|
||||
}
|
||||
|
||||
// --- Job Tracking ---
|
||||
|
||||
async function createOrResumeJob(args: string[]): Promise<string | null> {
|
||||
const jobIdIndex = args.indexOf('--job-id');
|
||||
if (jobIdIndex !== -1) {
|
||||
const jobId = args[jobIdIndex + 1];
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: { status: 'running', startedAt: new Date() },
|
||||
});
|
||||
return jobId;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
const dryRun = !args.includes('--execute');
|
||||
const countryIdx = args.indexOf('--country');
|
||||
const country = countryIdx !== -1 ? args[countryIdx + 1] : undefined;
|
||||
|
||||
log('============================================================');
|
||||
log('Wikidata Church Website Enrichment');
|
||||
log('============================================================');
|
||||
log(`Mode: ${dryRun ? 'Dry run' : 'Execute'}`);
|
||||
log(`Country: ${country || 'All'}`);
|
||||
log('============================================================');
|
||||
|
||||
// Job tracking
|
||||
let jobId = await createOrResumeJob(args);
|
||||
if (!jobId && !dryRun) {
|
||||
const job = await prisma.backgroundJob.create({
|
||||
data: {
|
||||
type: 'wikidata-enrichment',
|
||||
status: 'running',
|
||||
startedAt: new Date(),
|
||||
config: { country, dryRun },
|
||||
},
|
||||
});
|
||||
jobId = job.id;
|
||||
log(`Job ID: ${jobId}`);
|
||||
}
|
||||
|
||||
let totalFetched = 0;
|
||||
let matched = 0;
|
||||
let updated = 0;
|
||||
let noMatch = 0;
|
||||
let alreadyHasWebsite = 0;
|
||||
let offset = 0;
|
||||
|
||||
try {
|
||||
while (true) {
|
||||
log(`Querying Wikidata (offset ${offset})...`);
|
||||
const results = await queryWikidata(country, offset);
|
||||
|
||||
if (results.length === 0) {
|
||||
log('No more results from Wikidata.');
|
||||
break;
|
||||
}
|
||||
|
||||
totalFetched += results.length;
|
||||
log(`Fetched ${results.length} churches from Wikidata (total: ${totalFetched})`);
|
||||
|
||||
for (const wdChurch of results) {
|
||||
if (!wdChurch.website || !wdChurch.lat || !wdChurch.lon) continue;
|
||||
|
||||
const match = await findMatch(wdChurch);
|
||||
|
||||
if (!match) {
|
||||
noMatch++;
|
||||
continue;
|
||||
}
|
||||
|
||||
matched++;
|
||||
log(` Match: "${wdChurch.label}" (${wdChurch.wikidataId}) -> "${match.churchName}" (dist: ${match.distance.toFixed(3)}km, score: ${match.nameScore.toFixed(2)})`);
|
||||
|
||||
if (!dryRun) {
|
||||
await prisma.church.update({
|
||||
where: { id: match.churchId },
|
||||
data: {
|
||||
website: wdChurch.website,
|
||||
hasWebsite: true,
|
||||
},
|
||||
});
|
||||
updated++;
|
||||
}
|
||||
}
|
||||
|
||||
// Rate limit SPARQL queries
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
offset += BATCH_SIZE;
|
||||
|
||||
// Update job progress
|
||||
if (jobId) {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: {
|
||||
processed: totalFetched,
|
||||
succeeded: updated,
|
||||
itemsFound: matched,
|
||||
},
|
||||
});
|
||||
|
||||
// Check for stop
|
||||
const job = await prisma.backgroundJob.findUnique({ where: { id: jobId } });
|
||||
if (job?.status === 'stopping') {
|
||||
log('Job stop requested.');
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
} catch (error: any) {
|
||||
logError(`Error: ${error.message}`);
|
||||
if (jobId) {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: { status: 'failed', error: error.message, completedAt: new Date() },
|
||||
});
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
|
||||
// Complete job
|
||||
if (jobId) {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: { status: 'completed', completedAt: new Date(), processed: totalFetched, succeeded: updated, itemsFound: matched },
|
||||
});
|
||||
}
|
||||
|
||||
log('');
|
||||
log('============================================================');
|
||||
log('Wikidata Enrichment Summary');
|
||||
log('============================================================');
|
||||
log(`Wikidata churches fetched: ${totalFetched}`);
|
||||
log(`Matched to DB churches: ${matched}`);
|
||||
log(`Websites updated: ${updated}`);
|
||||
log(`No match found: ${noMatch}`);
|
||||
log(`Already had website: ${alreadyHasWebsite}`);
|
||||
log('============================================================');
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
logError(`Fatal error: ${error.message}`);
|
||||
process.exit(1);
|
||||
});
|
||||
@@ -178,7 +178,6 @@ async function importFromBaidu(dryRun: boolean, resumeFromCell: number, jobId: s
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -259,7 +258,6 @@ async function importFromBaidu(dryRun: boolean, resumeFromCell: number, jobId: s
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'baidu',
|
||||
website: baiduChurch.website || null,
|
||||
phone: baiduChurch.phone || null,
|
||||
|
||||
@@ -287,7 +287,6 @@ async function loadExistingCzechChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -415,7 +414,6 @@ async function processChurch(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'bohosluzby',
|
||||
website: null,
|
||||
phone: null,
|
||||
|
||||
@@ -30,6 +30,12 @@ import { findDuplicateChurch } from '../src/lib/church-matcher';
|
||||
import type { ExistingChurch } from '../src/lib/church-matcher';
|
||||
import { getDayNamesForCountry, buildDayPatterns } from '../src/scrapers/i18n/day-names';
|
||||
|
||||
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
|
||||
console.log(`Connecting to: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
|
||||
const pool = new Pool({ connectionString: dbUrl, ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
// ─── Site Config ─────────────────────────────────────────────────────────────
|
||||
|
||||
interface SiteConfig {
|
||||
@@ -218,6 +224,137 @@ function sleep(ms: number): Promise<void> {
|
||||
return new Promise(resolve => setTimeout(resolve, ms));
|
||||
}
|
||||
|
||||
// ─── DB Helpers ───────────────────────────────────────────────────────────────
|
||||
|
||||
async function loadExistingChurches(country: string): Promise<ExistingChurch[]> {
|
||||
console.log(`Loading existing ${country} churches from DB...`);
|
||||
const churches = await prisma.church.findMany({
|
||||
where: { country },
|
||||
select: {
|
||||
id: true, name: true, latitude: true, longitude: true,
|
||||
osmId: true, baiduId: true, masstimesId: true,
|
||||
orarimesseId: true, massSchedulesPhId: true, philmassId: true,
|
||||
horariosMisasId: true, mszeInfoId: true, weekdayMassesId: true,
|
||||
messesInfoId: true, bohosluzbyId: true, miserendId: true,
|
||||
kerknetId: true, gottesdienstzeitenId: true, discovermassId: true,
|
||||
buscarmisasNetworkId: true,
|
||||
source: true, website: true, phone: true, address: true, country: true,
|
||||
},
|
||||
});
|
||||
console.log(` Loaded ${churches.length} existing ${country} churches`);
|
||||
return churches as ExistingChurch[];
|
||||
}
|
||||
|
||||
// ─── Church Processing ────────────────────────────────────────────────────────
|
||||
|
||||
async function processChurch(
|
||||
url: string,
|
||||
domain: string,
|
||||
config: SiteConfig,
|
||||
existingChurches: ExistingChurch[],
|
||||
args: CLIArgs,
|
||||
stats: ImportStats,
|
||||
): Promise<void> {
|
||||
stats.total++;
|
||||
try {
|
||||
const html = await fetchWithRetry(url);
|
||||
const parsed = parseChurchPage(html, domain, url, config);
|
||||
if (!parsed) {
|
||||
console.log(` [skip] No name/coords: ${url}`);
|
||||
stats.skipped++;
|
||||
return;
|
||||
}
|
||||
|
||||
const masses = parseMassSchedule(html, config.country);
|
||||
|
||||
if (args.dryRun) {
|
||||
console.log(` [dry-run] ${parsed.name} — ${masses.length} masses`);
|
||||
return;
|
||||
}
|
||||
|
||||
const candidate = {
|
||||
name: parsed.name,
|
||||
lat: parsed.lat,
|
||||
lng: parsed.lng,
|
||||
buscarmisasNetworkId: parsed.externalId,
|
||||
};
|
||||
const duplicate = findDuplicateChurch(candidate, existingChurches);
|
||||
|
||||
if (duplicate) {
|
||||
const updateData: Record<string, unknown> = { buscarmisasNetworkId: parsed.externalId };
|
||||
if (!duplicate.phone && parsed.phone) updateData.phone = parsed.phone;
|
||||
if (parsed.lat !== 0 && duplicate.latitude === 0) {
|
||||
updateData.latitude = parsed.lat;
|
||||
updateData.longitude = parsed.lng;
|
||||
}
|
||||
|
||||
await prisma.$transaction(async (tx) => {
|
||||
await tx.church.update({ where: { id: duplicate.id }, data: updateData });
|
||||
if (masses.length > 0) {
|
||||
await tx.massSchedule.deleteMany({ where: { churchId: duplicate.id } });
|
||||
await tx.massSchedule.createMany({
|
||||
data: masses.map(m => ({ churchId: duplicate.id, dayOfWeek: m.dayOfWeek, time: m.time, language: config.language === 'pt' ? 'Portuguese' : 'Spanish', notes: null })),
|
||||
});
|
||||
}
|
||||
await tx.church.update({ where: { id: duplicate.id }, data: { lastScrapedAt: new Date() } });
|
||||
});
|
||||
duplicate.buscarmisasNetworkId = parsed.externalId;
|
||||
stats.updated++;
|
||||
} else {
|
||||
const church = await prisma.church.create({
|
||||
data: {
|
||||
name: parsed.name,
|
||||
address: parsed.address,
|
||||
city: parsed.city,
|
||||
state: parsed.state,
|
||||
country: parsed.country,
|
||||
phone: parsed.phone,
|
||||
latitude: parsed.lat,
|
||||
longitude: parsed.lng,
|
||||
buscarmisasNetworkId: parsed.externalId,
|
||||
source: 'buscarmisas-network',
|
||||
hasWebsite: false,
|
||||
},
|
||||
});
|
||||
|
||||
existingChurches.push({
|
||||
id: church.id, name: parsed.name, latitude: parsed.lat, longitude: parsed.lng,
|
||||
osmId: null, baiduId: null, masstimesId: null, orarimesseId: null,
|
||||
massSchedulesPhId: null, philmassId: null, horariosMisasId: null,
|
||||
mszeInfoId: null, weekdayMassesId: null, messesInfoId: null,
|
||||
bohosluzbyId: null, miserendId: null, kerknetId: null,
|
||||
gottesdienstzeitenId: null, discovermassId: null,
|
||||
buscarmisasNetworkId: parsed.externalId,
|
||||
source: 'buscarmisas-network', website: null, phone: parsed.phone,
|
||||
address: parsed.address, country: parsed.country,
|
||||
});
|
||||
|
||||
if (masses.length > 0) {
|
||||
await prisma.massSchedule.createMany({
|
||||
data: masses.map(m => ({
|
||||
churchId: church.id,
|
||||
dayOfWeek: m.dayOfWeek,
|
||||
time: m.time,
|
||||
language: config.language === 'pt' ? 'Portuguese' : 'Spanish',
|
||||
notes: null,
|
||||
})),
|
||||
});
|
||||
await prisma.church.update({ where: { id: church.id }, data: { lastScrapedAt: new Date() } });
|
||||
}
|
||||
stats.created++;
|
||||
}
|
||||
|
||||
stats.massSchedulesCreated += masses.length;
|
||||
console.log(
|
||||
` [${duplicate ? 'update' : 'create'}] ${parsed.name} — ${masses.length} masses — ` +
|
||||
`${stats.total} total (${stats.created}↑ ${stats.updated}↻ ${stats.errors}✗)`
|
||||
);
|
||||
} catch (err) {
|
||||
stats.errors++;
|
||||
console.error(` [error] ${url}: ${err instanceof Error ? err.message : err}`);
|
||||
}
|
||||
}
|
||||
|
||||
// ─── Sitemap Discovery ────────────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
@@ -257,3 +394,141 @@ export async function getChurchUrls(domain: string, config: SiteConfig): Promise
|
||||
console.log(` Total church URLs: ${unique.length}`);
|
||||
return unique;
|
||||
}
|
||||
|
||||
// ─── CLI ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
function parseCLIArgs(): CLIArgs {
|
||||
const argv = process.argv.slice(2);
|
||||
const result: CLIArgs = { domain: null, all: false, dryRun: false, resumeFrom: 0, limit: null, jobId: null };
|
||||
for (let i = 0; i < argv.length; i++) {
|
||||
switch (argv[i]) {
|
||||
case '--domain': result.domain = argv[++i]; break;
|
||||
case '--all': result.all = true; break;
|
||||
case '--dry-run': result.dryRun = true; break;
|
||||
case '--resume-from': result.resumeFrom = parseInt(argv[++i], 10); break;
|
||||
case '--limit': result.limit = parseInt(argv[++i], 10); break;
|
||||
case '--job-id': result.jobId = argv[++i]; break;
|
||||
}
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
function validateArgs(args: CLIArgs): void {
|
||||
if (!args.domain && !args.all) {
|
||||
console.error('Usage:');
|
||||
console.error(' npx tsx scripts/import-buscarmisas-network.ts --domain <domain>');
|
||||
console.error(' npx tsx scripts/import-buscarmisas-network.ts --all');
|
||||
console.error('\nValid domains:', Object.keys(NETWORK_SITES).join(', '));
|
||||
process.exit(1);
|
||||
}
|
||||
if (args.domain && !NETWORK_SITES[args.domain]) {
|
||||
console.error(`Unknown domain: ${args.domain}`);
|
||||
console.error('Valid domains:', Object.keys(NETWORK_SITES).join(', '));
|
||||
process.exit(1);
|
||||
}
|
||||
if (args.all && args.resumeFrom > 0) {
|
||||
console.error('--resume-from cannot be used with --all. Use --domain to resume a specific site.');
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
async function runDomain(domain: string, config: SiteConfig, args: CLIArgs): Promise<ImportStats> {
|
||||
const stats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 };
|
||||
|
||||
const allUrls = await getChurchUrls(domain, config);
|
||||
const existingChurches = await loadExistingChurches(config.country);
|
||||
|
||||
// Build set of already-imported IDs for fast skip
|
||||
const importedIds = new Set(
|
||||
existingChurches.filter(c => c.buscarmisasNetworkId).map(c => c.buscarmisasNetworkId!)
|
||||
);
|
||||
|
||||
let candidateUrls = allUrls.slice(args.resumeFrom).filter(url => {
|
||||
const externalId = buildExternalId(domain, url);
|
||||
return !importedIds.has(externalId);
|
||||
});
|
||||
if (args.limit !== null) candidateUrls = candidateUrls.slice(0, args.limit);
|
||||
|
||||
console.log(`\n${domain}: ${allUrls.length} total | ${importedIds.size} already imported | ${candidateUrls.length} to process\n`);
|
||||
|
||||
for (let i = 0; i < candidateUrls.length; i++) {
|
||||
const url = candidateUrls[i];
|
||||
console.log(`[${i + 1}/${candidateUrls.length}] ${url}`);
|
||||
await processChurch(url, domain, config, existingChurches, args, stats);
|
||||
if (i < candidateUrls.length - 1) await sleep(REQUEST_DELAY_MS);
|
||||
}
|
||||
|
||||
return stats;
|
||||
}
|
||||
|
||||
// ─── Main ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
async function main() {
|
||||
const args = parseCLIArgs();
|
||||
validateArgs(args);
|
||||
|
||||
if (args.jobId) {
|
||||
try {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: args.jobId },
|
||||
data: { status: 'running', startedAt: new Date() },
|
||||
});
|
||||
} catch { /* job may not exist yet */ }
|
||||
}
|
||||
|
||||
const domainsToRun: [string, SiteConfig][] = args.all
|
||||
? Object.entries(NETWORK_SITES)
|
||||
: [[args.domain!, NETWORK_SITES[args.domain!]]];
|
||||
|
||||
const totalStats: ImportStats = { total: 0, created: 0, updated: 0, skipped: 0, errors: 0, massSchedulesCreated: 0 };
|
||||
|
||||
try {
|
||||
for (let d = 0; d < domainsToRun.length; d++) {
|
||||
const [domain, config] = domainsToRun[d];
|
||||
console.log(`\n${'─'.repeat(60)}`);
|
||||
console.log(`Domain ${d + 1}/${domainsToRun.length}: ${domain} (${config.country})`);
|
||||
console.log('─'.repeat(60));
|
||||
const stats = await runDomain(domain, config, args);
|
||||
totalStats.total += stats.total;
|
||||
totalStats.created += stats.created;
|
||||
totalStats.updated += stats.updated;
|
||||
totalStats.skipped += stats.skipped;
|
||||
totalStats.errors += stats.errors;
|
||||
totalStats.massSchedulesCreated += stats.massSchedulesCreated;
|
||||
if (d < domainsToRun.length - 1) await sleep(DOMAIN_DELAY_MS);
|
||||
}
|
||||
} finally {
|
||||
console.log('\n─── Import Complete ───────────────────────────────────────');
|
||||
console.log(`Total processed: ${totalStats.total}`);
|
||||
console.log(`Created: ${totalStats.created}`);
|
||||
console.log(`Updated: ${totalStats.updated}`);
|
||||
console.log(`Skipped: ${totalStats.skipped}`);
|
||||
console.log(`Errors: ${totalStats.errors}`);
|
||||
console.log(`Mass schedules: ${totalStats.massSchedulesCreated}`);
|
||||
|
||||
if (args.jobId) {
|
||||
const status = totalStats.errors > totalStats.total * 0.1 ? 'failed' : 'completed';
|
||||
try {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: args.jobId },
|
||||
data: {
|
||||
status,
|
||||
completedAt: new Date(),
|
||||
processed: totalStats.total,
|
||||
succeeded: totalStats.created + totalStats.updated,
|
||||
failed: totalStats.errors,
|
||||
itemsFound: totalStats.massSchedulesCreated,
|
||||
},
|
||||
});
|
||||
} catch { /* ignore */ }
|
||||
}
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
main().catch(err => {
|
||||
console.error('Fatal error:', err);
|
||||
process.exit(1);
|
||||
});
|
||||
|
||||
@@ -94,6 +94,7 @@ interface CLIArgs {
|
||||
all: boolean;
|
||||
dryRun: boolean;
|
||||
resumeFrom?: number;
|
||||
limit?: number;
|
||||
jobId?: string;
|
||||
}
|
||||
|
||||
@@ -507,6 +508,7 @@ function parseCLIArgs(): CLIArgs {
|
||||
case '--all': result.all = true; break;
|
||||
case '--dry-run': result.dryRun = true; break;
|
||||
case '--resume-from': result.resumeFrom = parseInt(args[++i], 10); break;
|
||||
case '--limit': result.limit = parseInt(args[++i], 10); break;
|
||||
case '--job-id': result.jobId = args[++i]; break;
|
||||
}
|
||||
}
|
||||
@@ -540,14 +542,25 @@ async function main() {
|
||||
try {
|
||||
const urls = await getAllChurchUrls();
|
||||
const existingChurches = await loadExistingChurches();
|
||||
|
||||
// Skip already-imported churches — check discovermassId set in DB
|
||||
const importedSlugs = new Set(
|
||||
existingChurches.filter(c => c.discovermassId).map(c => c.discovermassId!)
|
||||
);
|
||||
|
||||
// Apply --resume-from first, then filter to unimported, then apply --limit
|
||||
const startIdx = args.resumeFrom ?? 0;
|
||||
const churchUrls = urls.slice(startIdx);
|
||||
console.log(`\nProcessing ${churchUrls.length} churches (starting from index ${startIdx})...\n`);
|
||||
const candidateUrls = urls.slice(startIdx).filter(url => {
|
||||
const slug = url.replace('https://discovermass.com/church/', '').replace(/\/$/, '');
|
||||
return !importedSlugs.has(slug);
|
||||
});
|
||||
const churchUrls = args.limit ? candidateUrls.slice(0, args.limit) : candidateUrls;
|
||||
|
||||
console.log(`\nSitemap total: ${urls.length} | Already imported: ${importedSlugs.size} | This run: ${churchUrls.length}${args.limit ? ` (limit ${args.limit})` : ''}\n`);
|
||||
|
||||
for (let i = 0; i < churchUrls.length; i++) {
|
||||
const url = churchUrls[i];
|
||||
const overallIdx = startIdx + i;
|
||||
console.log(`[${overallIdx + 1}/${urls.length}] ${url}`);
|
||||
console.log(`[${i + 1}/${churchUrls.length}] ${url}`);
|
||||
await processChurch(url, existingChurches, args, stats);
|
||||
if (i < churchUrls.length - 1) {
|
||||
await sleep(REQUEST_DELAY_MS);
|
||||
|
||||
@@ -401,7 +401,6 @@ async function loadExistingChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -516,7 +515,6 @@ async function importChurch(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'gcatholic',
|
||||
website: church.website || null,
|
||||
phone: church.phone || null,
|
||||
|
||||
@@ -316,7 +316,6 @@ async function loadExistingGermanChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -479,7 +478,6 @@ async function processDiocese(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: gdzId,
|
||||
discovermassId: null,
|
||||
source: 'gottesdienstzeiten',
|
||||
website: church.website,
|
||||
phone: church.phone,
|
||||
|
||||
244
scripts/import-hk-parishes.test.ts
Normal file
244
scripts/import-hk-parishes.test.ts
Normal file
@@ -0,0 +1,244 @@
|
||||
import { test } from 'node:test';
|
||||
import assert from 'node:assert/strict';
|
||||
import {
|
||||
splitEntries,
|
||||
extractNames,
|
||||
extractFields,
|
||||
normalizeTime,
|
||||
parseScheduleLine,
|
||||
parseWeekdayLine,
|
||||
parseEntry,
|
||||
normalizeName,
|
||||
findMatch,
|
||||
} from './import-hk-parishes.js';
|
||||
|
||||
// ─── Task 2: Entry splitter and name extractor ────────────────────────────────
|
||||
|
||||
test('splitEntries splits on Path/Close boundary', () => {
|
||||
const raw = `HONG KONG CHURCHES\n\nParish A\nChurch A\nPath\nClose\nAddress\n1 Main St\n\nParish B\nChurch B\nPath\nClose\nAddress\n2 Side St\n`;
|
||||
const entries = splitEntries(raw);
|
||||
assert.equal(entries.length, 2);
|
||||
assert.ok(entries[0].includes('Church A'));
|
||||
assert.ok(entries[1].includes('Church B'));
|
||||
});
|
||||
|
||||
test('extractNames returns locationName and parishName', () => {
|
||||
const pre = `Holy Cross Parish\nHOLY CROSS CHURCH`;
|
||||
const result = extractNames(pre);
|
||||
assert.equal(result.locationName, 'HOLY CROSS CHURCH');
|
||||
assert.equal(result.parishName, 'Holy Cross Parish');
|
||||
});
|
||||
|
||||
test('extractNames strips Share and leading-space artifacts', () => {
|
||||
const pre = `Share\n Carmelite Monastery\nSt. Anne's Parish\nCarmelite Monastery`;
|
||||
const result = extractNames(pre);
|
||||
assert.equal(result.locationName, 'Carmelite Monastery');
|
||||
assert.equal(result.parishName, "St. Anne's Parish");
|
||||
});
|
||||
|
||||
test('extractNames handles single name line', () => {
|
||||
const pre = `Cathedral Parish`;
|
||||
const result = extractNames(pre);
|
||||
assert.equal(result.locationName, 'Cathedral Parish');
|
||||
assert.equal(result.parishName, null);
|
||||
});
|
||||
|
||||
// ─── Task 3: Field extractor ──────────────────────────────────────────────────
|
||||
|
||||
test('extractFields parses address, phone, email', () => {
|
||||
const body = `Address\n1 Holy Cross Path, Shau Kei Wan, Hong Kong\n\nPhone\n(852)2560-1823\n\nFax\n(852)2535-8246\n\nEmail\nholycrosshk@gmail.com\n\nWebsite\nClick Here\n\nMass Time\n`;
|
||||
const f = extractFields(body);
|
||||
assert.equal(f.address, '1 Holy Cross Path, Shau Kei Wan, Hong Kong');
|
||||
assert.equal(f.phone, '(852)2560-1823');
|
||||
assert.equal(f.email, 'holycrosshk@gmail.com');
|
||||
});
|
||||
|
||||
test('extractFields handles missing fields gracefully', () => {
|
||||
const body = `Address\nSalesian School, 16 Chai Wan Road, Hong Kong.\n\nMass Time\n`;
|
||||
const f = extractFields(body);
|
||||
assert.equal(f.address, 'Salesian School, 16 Chai Wan Road, Hong Kong.');
|
||||
assert.equal(f.phone, null);
|
||||
assert.equal(f.email, null);
|
||||
});
|
||||
|
||||
test('extractFields strips full-width parens from phone', () => {
|
||||
const body = `Phone\n(852)2819-5777, 2819-5845\n\n`;
|
||||
const f = extractFields(body);
|
||||
assert.equal(f.phone, '(852)2819-5777, 2819-5845');
|
||||
});
|
||||
|
||||
// ─── Task 4: Time normalizer ──────────────────────────────────────────────────
|
||||
|
||||
test('normalizeTime handles am/pm with spaces', () => {
|
||||
assert.equal(normalizeTime('8:00am'), '08:00');
|
||||
assert.equal(normalizeTime('11:30 am'), '11:30');
|
||||
assert.equal(normalizeTime('6:00pm'), '18:00');
|
||||
assert.equal(normalizeTime('6:30 pm'), '18:30');
|
||||
});
|
||||
|
||||
test('normalizeTime handles a.m./p.m. format', () => {
|
||||
assert.equal(normalizeTime('7:00 a.m.'), '07:00');
|
||||
assert.equal(normalizeTime('7:45 a.m.'), '07:45');
|
||||
assert.equal(normalizeTime('6:00 p.m.'), '18:00');
|
||||
});
|
||||
|
||||
test('normalizeTime handles noon', () => {
|
||||
assert.equal(normalizeTime('12:00 noon'), '12:00');
|
||||
assert.equal(normalizeTime('12:30 pm'), '12:30');
|
||||
});
|
||||
|
||||
test('normalizeTime handles 12:00am as midnight', () => {
|
||||
assert.equal(normalizeTime('12:00am'), '00:00');
|
||||
});
|
||||
|
||||
test('normalizeTime returns null for unrecognised input', () => {
|
||||
assert.equal(normalizeTime('Monday'), null);
|
||||
assert.equal(normalizeTime(''), null);
|
||||
});
|
||||
|
||||
// ─── Task 5: Schedule line parser ────────────────────────────────────────────
|
||||
|
||||
test('parseScheduleLine parses single time with language', () => {
|
||||
const results = parseScheduleLine('9:30am (English)', 0);
|
||||
assert.equal(results.length, 1);
|
||||
assert.deepEqual(results[0], { dayOfWeek: 0, time: '09:30', language: 'English', notes: null });
|
||||
});
|
||||
|
||||
test('parseScheduleLine parses multiple comma-separated times', () => {
|
||||
const results = parseScheduleLine('8:00am,10:30 am (Cantonese)', 0);
|
||||
assert.equal(results.length, 2);
|
||||
assert.equal(results[0].time, '08:00');
|
||||
assert.equal(results[1].time, '10:30');
|
||||
assert.equal(results[1].language, 'Cantonese');
|
||||
});
|
||||
|
||||
test('parseScheduleLine handles missing closing paren', () => {
|
||||
const results = parseScheduleLine('9:30 am (Cantonese', 0);
|
||||
assert.equal(results[0].language, 'Cantonese');
|
||||
});
|
||||
|
||||
test('parseScheduleLine defaults language to English when not specified', () => {
|
||||
const results = parseScheduleLine('8:00am', 0);
|
||||
assert.equal(results[0].language, 'English');
|
||||
});
|
||||
|
||||
test('parseScheduleLine stores embedded note text', () => {
|
||||
const results = parseScheduleLine('9:00 am Sunday School & Family Mass,11:30am (English)', 0);
|
||||
assert.equal(results.length, 2);
|
||||
assert.equal(results[0].time, '09:00');
|
||||
assert.equal(results[0].notes, 'Sunday School & Family Mass');
|
||||
});
|
||||
|
||||
test('parseScheduleLine handles Saturday anticipated format variations', () => {
|
||||
const results = parseScheduleLine('Saturday 3:45 pm,Saturday 6:30 pm (Cantonese)', 6);
|
||||
assert.equal(results.length, 2);
|
||||
assert.equal(results[0].time, '15:45');
|
||||
assert.equal(results[1].time, '18:30');
|
||||
});
|
||||
|
||||
test('parseScheduleLine handles "on Saturday" suffix format', () => {
|
||||
const results = parseScheduleLine('6:00pm on Saturday (Cantonese)', 6);
|
||||
assert.equal(results.length, 1);
|
||||
assert.equal(results[0].time, '18:00');
|
||||
assert.equal(results[0].language, 'Cantonese');
|
||||
});
|
||||
|
||||
test('parseScheduleLine handles conditional prefix as notes', () => {
|
||||
const results = parseScheduleLine('5th Sunday of the month: 7:15 am (Tagalog)', 0);
|
||||
assert.equal(results.length, 1);
|
||||
assert.equal(results[0].time, '07:15');
|
||||
assert.equal(results[0].language, 'Tagalog');
|
||||
assert.ok(results[0].notes?.includes('5th Sunday'));
|
||||
});
|
||||
|
||||
// ─── Task 6: Weekday day-prefix parser ───────────────────────────────────────
|
||||
|
||||
test('parseWeekdayLine no prefix = all weekdays Mon-Fri', () => {
|
||||
const results = parseWeekdayLine('7:15 am (Cantonese)');
|
||||
assert.equal(results.length, 5);
|
||||
assert.ok(results.every(r => r.time === '07:15'));
|
||||
assert.ok(results.every(r => r.language === 'Cantonese'));
|
||||
assert.deepEqual(results.map(r => r.dayOfWeek), [1, 2, 3, 4, 5]);
|
||||
});
|
||||
|
||||
test('parseWeekdayLine abbreviation list', () => {
|
||||
const results = parseWeekdayLine('Mon., Tue., Thur. 8:00 a.m. (Cantonese)');
|
||||
assert.equal(results.length, 3);
|
||||
assert.deepEqual(results.map(r => r.dayOfWeek).sort(), [1, 2, 4]);
|
||||
});
|
||||
|
||||
test('parseWeekdayLine abbreviation range Mon. to Sat.', () => {
|
||||
const results = parseWeekdayLine('Mon. to Sat. 9:15 am (English)');
|
||||
assert.equal(results.length, 6);
|
||||
assert.deepEqual(results.map(r => r.dayOfWeek).sort(), [1, 2, 3, 4, 5, 6]);
|
||||
});
|
||||
|
||||
test('parseWeekdayLine full-word range Monday to Friday', () => {
|
||||
const results = parseWeekdayLine('Monday to Friday: 12:00 noon (English)');
|
||||
assert.equal(results.length, 5);
|
||||
assert.ok(results.every(r => r.time === '12:00'));
|
||||
});
|
||||
|
||||
test('parseWeekdayLine ampersand separator', () => {
|
||||
const results = parseWeekdayLine('Tue., Thur. & Sat. 9:45 a.m. (Cantonese)');
|
||||
assert.equal(results.length, 3);
|
||||
assert.deepEqual(results.map(r => r.dayOfWeek).sort(), [2, 4, 6]);
|
||||
});
|
||||
|
||||
test('parseWeekdayLine multiple time groups on one line', () => {
|
||||
const results = parseWeekdayLine('Monday to Saturday: 7:45 am,Monday to Friday: 12:00 noon,Monday to Friday: 6:00 pm (English)');
|
||||
assert.equal(results.length, 16);
|
||||
});
|
||||
|
||||
// ─── Task 7: Full entry parser ────────────────────────────────────────────────
|
||||
|
||||
test('parseEntry extracts names, fields, and schedules from a full entry', () => {
|
||||
const raw = `Holy Cross Parish\nHOLY CROSS CHURCH\nPath\nClose\nAddress\n1 Holy Cross Path, Shau Kei Wan, Hong Kong\n\nPhone\n(852)2560-1823\n\nEmail\nholycrosshk@gmail.com\n\nWebsite\nClick Here\n\nMass Time\nSunday Masses\n8:00am,9:30am (Cantonese)\n1:00 pm (English)\n\nAnticipated Sunday Masses\nSaturday 3:45 pm,Saturday 6:30 pm (Cantonese)\n\nWeekday Masses\n7:15 am (Cantonese)\n\nSpecial Masses\nSomething irrelevant\n`;
|
||||
const entry = parseEntry(raw);
|
||||
assert.equal(entry.locationName, 'HOLY CROSS CHURCH');
|
||||
assert.equal(entry.parishName, 'Holy Cross Parish');
|
||||
assert.equal(entry.address, '1 Holy Cross Path, Shau Kei Wan, Hong Kong');
|
||||
assert.equal(entry.phone, '(852)2560-1823');
|
||||
assert.equal(entry.email, 'holycrosshk@gmail.com');
|
||||
// Sunday: 2 Cantonese + 1 English = 3 entries
|
||||
const sunday = entry.schedules.filter(s => s.dayOfWeek === 0);
|
||||
assert.equal(sunday.length, 3);
|
||||
// Anticipated (Saturday): 2 entries
|
||||
const saturday = entry.schedules.filter(s => s.dayOfWeek === 6);
|
||||
assert.equal(saturday.length, 2);
|
||||
// Weekday: 5 entries (Mon–Fri)
|
||||
const weekday = entry.schedules.filter(s => s.dayOfWeek >= 1 && s.dayOfWeek <= 5);
|
||||
assert.equal(weekday.length, 5);
|
||||
});
|
||||
|
||||
// ─── Task 8: Name normalizer + matcher ───────────────────────────────────────
|
||||
|
||||
test('normalizeName strips noise words and lowercases', () => {
|
||||
assert.equal(normalizeName('HOLY CROSS CHURCH'), 'holy cross');
|
||||
assert.equal(normalizeName('Our Lady Of Mount Carmel Church'), 'mount carmel');
|
||||
assert.equal(normalizeName("St. Joseph's Parish"), 'joseph');
|
||||
assert.equal(normalizeName('Salesian Mass Centre'), 'salesian');
|
||||
});
|
||||
|
||||
test('findMatch matches by name overlap', () => {
|
||||
const existing = [
|
||||
{ id: '1', name: 'Holy Cross (Sai Wan Ho)', address: '1 Holy Cross Path', phone: null, email: null },
|
||||
{ id: '2', name: 'St Joseph (Central)', address: '37 Garden Road', phone: null, email: null },
|
||||
];
|
||||
assert.equal(findMatch('HOLY CROSS CHURCH', '1 Holy Cross Path', existing)?.id, '1');
|
||||
assert.equal(findMatch("St. Joseph's Church", '37 Garden Road', existing)?.id, '2');
|
||||
});
|
||||
|
||||
test('findMatch falls back to address prefix match', () => {
|
||||
const existing = [
|
||||
{ id: '3', name: '聖母聖衣堂 (Our Lady of Mount Carmel Wanchai)', address: 'No.1, Star Street', phone: null, email: null },
|
||||
];
|
||||
assert.equal(findMatch('Our Lady Of Mount Carmel Church', 'No.1, Star Street, Wan Chai', existing)?.id, '3');
|
||||
});
|
||||
|
||||
test('findMatch returns null for no match', () => {
|
||||
const existing = [
|
||||
{ id: '1', name: 'Holy Cross (Sai Wan Ho)', address: '1 Holy Cross Path', phone: null, email: null },
|
||||
];
|
||||
assert.equal(findMatch('Salesian Mass Centre', 'Salesian School, 16 Chai Wan Road', existing), null);
|
||||
});
|
||||
584
scripts/import-hk-parishes.ts
Normal file
584
scripts/import-hk-parishes.ts
Normal file
@@ -0,0 +1,584 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Import HK Diocese parish directory from plain-text paste.
|
||||
* Usage: npx tsx scripts/import-hk-parishes.ts [--dry-run] [--file scripts/hk-parishes.txt]
|
||||
*/
|
||||
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
import fs from 'fs';
|
||||
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env.local') });
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
|
||||
const dbUrl = process.env.DATABASE_URL || 'postgresql://postgres:postgres@localhost:5432/nearestmass';
|
||||
console.log(`Connecting to database: ${dbUrl.replace(/:[^:@]+@/, ':***@')}`);
|
||||
const pool = new Pool({
|
||||
connectionString: dbUrl,
|
||||
ssl: dbUrl.includes('neon') ? { rejectUnauthorized: false } : undefined,
|
||||
});
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
// ─── Types ────────────────────────────────────────────────────────────────────
|
||||
|
||||
export interface ParsedSchedule {
|
||||
dayOfWeek: number; // 0=Sun, 1=Mon, ..., 6=Sat
|
||||
time: string; // "HH:MM"
|
||||
language: string; // "English" | "Cantonese" | "Tagalog"
|
||||
notes: string | null;
|
||||
}
|
||||
|
||||
export interface ParsedEntry {
|
||||
locationName: string;
|
||||
parishName: string | null;
|
||||
address: string | null;
|
||||
phone: string | null;
|
||||
email: string | null;
|
||||
schedules: ParsedSchedule[];
|
||||
}
|
||||
|
||||
interface ExistingChurch {
|
||||
id: string;
|
||||
name: string;
|
||||
address: string | null;
|
||||
phone: string | null;
|
||||
email: string | null;
|
||||
}
|
||||
|
||||
interface ImportStats {
|
||||
matched: number;
|
||||
created: number;
|
||||
schedulesWritten: number;
|
||||
skipped: number;
|
||||
}
|
||||
|
||||
// ─── Parser ───────────────────────────────────────────────────────────────────
|
||||
|
||||
const ARTIFACT_LINES = new Set(['share', 'path', 'close', '']);
|
||||
|
||||
const LANG_PATTERN = /(Cantonese|English|Tagalog|Chinese)/i;
|
||||
|
||||
// ─── Task 2: Entry splitter and name extractor ────────────────────────────────
|
||||
|
||||
/**
|
||||
* Split raw file text into individual entry strings.
|
||||
* Entries are delimited by "Path\nClose" which appears in every entry.
|
||||
* The header segment ("HONG KONG CHURCHES\n\n...") before the first entry is discarded.
|
||||
*/
|
||||
export function splitEntries(raw: string): string[] {
|
||||
const text = raw.replace(/\r\n/g, '\n').replace(/\r/g, '\n');
|
||||
const parts = text.split('\nPath\nClose\n');
|
||||
const entries: string[] = [];
|
||||
for (let i = 1; i < parts.length; i++) {
|
||||
const pre = parts[i - 1];
|
||||
const body = parts[i];
|
||||
entries.push(pre + '\nPath\nClose\n' + body);
|
||||
}
|
||||
return entries;
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract location name and parish name from the pre-marker text of an entry.
|
||||
*/
|
||||
export function extractNames(preMarker: string): { locationName: string; parishName: string | null } {
|
||||
const lines = preMarker
|
||||
.split('\n')
|
||||
.map(l => l.trimEnd())
|
||||
.filter(l => {
|
||||
const lower = l.trim().toLowerCase();
|
||||
return !ARTIFACT_LINES.has(lower) && !l.startsWith(' ');
|
||||
})
|
||||
.filter(l => l.trim().length > 0);
|
||||
|
||||
const nameLines = lines.slice(-2);
|
||||
if (nameLines.length === 0) return { locationName: 'Unknown', parishName: null };
|
||||
if (nameLines.length === 1) return { locationName: nameLines[0].trim(), parishName: null };
|
||||
return {
|
||||
locationName: nameLines[1].trim(),
|
||||
parishName: nameLines[0].trim(),
|
||||
};
|
||||
}
|
||||
|
||||
// ─── Task 3: Field extractor ──────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
* Extract address, phone, email from the entry body (text after Path/Close).
|
||||
* Full-width parentheses ( ) are normalised to ASCII ( ).
|
||||
*/
|
||||
export function extractFields(body: string): { address: string | null; phone: string | null; email: string | null } {
|
||||
const normalise = (s: string) => s.replace(/(/g, '(').replace(/)/g, ')').trim();
|
||||
|
||||
function extractField(fieldName: string): string | null {
|
||||
const regex = new RegExp(`\\b${fieldName}\\n([\\s\\S]*?)(?:\\n\\n|\\nFax|\\nEmail|\\nWebsite|\\nChurch|\\nParish|\\nAssistant|\\nDeacon|\\nSister|\\nChairperson|\\nResident|\\nRector|\\nP\\.C|\\nPastoral|\\nMass Time|$)`, 'i');
|
||||
const m = body.match(regex);
|
||||
if (!m) return null;
|
||||
const value = m[1].replace(/\n/g, ' ').trim();
|
||||
return value || null;
|
||||
}
|
||||
|
||||
const address = extractField('Address');
|
||||
const rawPhone = extractField('Phone');
|
||||
const email = extractField('Email');
|
||||
|
||||
return {
|
||||
address: address ? normalise(address) : null,
|
||||
phone: rawPhone ? normalise(rawPhone) : null,
|
||||
email: email || null,
|
||||
};
|
||||
}
|
||||
|
||||
// ─── Task 4: Time normalizer ──────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
* Normalise a time string to "HH:MM" 24-hour format.
|
||||
* Accepts: "8:00am", "11:30 am", "7:00 a.m.", "12:00 noon", etc.
|
||||
* Returns null if no valid time found.
|
||||
*/
|
||||
export function normalizeTime(raw: string): string | null {
|
||||
const s = raw.trim().toLowerCase();
|
||||
if (s.includes('noon')) {
|
||||
if (s === 'noon') return '12:00';
|
||||
const m = s.match(/(\d{1,2}):(\d{2})\s*noon/);
|
||||
if (m) return `${String(parseInt(m[1], 10)).padStart(2, '0')}:${m[2]}`;
|
||||
}
|
||||
|
||||
const m = s.match(/(\d{1,2}):(\d{2})\s*(am|pm|a\.m\.|p\.m\.)/);
|
||||
if (!m) return null;
|
||||
|
||||
let h = parseInt(m[1], 10);
|
||||
const min = parseInt(m[2], 10);
|
||||
const period = m[3].replace(/\./g, '').toLowerCase();
|
||||
|
||||
if (period === 'am') {
|
||||
if (h === 12) h = 0;
|
||||
} else {
|
||||
if (h !== 12) h += 12;
|
||||
}
|
||||
|
||||
return `${String(h).padStart(2, '0')}:${String(min).padStart(2, '0')}`;
|
||||
}
|
||||
|
||||
// ─── Task 5: Schedule line parser ────────────────────────────────────────────
|
||||
|
||||
const CONDITIONAL_PATTERN = /^([\w\s]+(?:Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|month)[^:]*:)\s*/i;
|
||||
|
||||
/**
|
||||
* Parse a single schedule text line into 0-N ParsedSchedule records.
|
||||
* dayOfWeek: the fixed day for this section (0=Sun, 6=Sat for Anticipated).
|
||||
*/
|
||||
export function parseScheduleLine(line: string, dayOfWeek: number): ParsedSchedule[] {
|
||||
let remainder = line.trim();
|
||||
let language = 'English';
|
||||
let sectionNotes: string | null = null;
|
||||
|
||||
// Extract language tag (with or without closing paren)
|
||||
const langMatch = remainder.match(/\(?(Cantonese|English|Tagalog|Chinese)\)?/i);
|
||||
if (langMatch) {
|
||||
const raw = langMatch[1].toLowerCase();
|
||||
language = raw === 'cantonese' || raw === 'chinese' ? 'Cantonese'
|
||||
: raw === 'tagalog' ? 'Tagalog'
|
||||
: 'English';
|
||||
remainder = remainder.replace(langMatch[0], '').trim();
|
||||
}
|
||||
|
||||
// Strip "Saturday" / "on Saturday" anchors (Anticipated Sunday section)
|
||||
remainder = remainder.replace(/\bSaturday\b/gi, '').replace(/\bon\b/gi, '').trim();
|
||||
|
||||
// Extract conditional note prefix
|
||||
const condMatch = remainder.match(CONDITIONAL_PATTERN);
|
||||
if (condMatch) {
|
||||
sectionNotes = condMatch[1].replace(/:$/, '').trim();
|
||||
remainder = remainder.slice(condMatch[0].length);
|
||||
}
|
||||
|
||||
// Split by comma into time tokens
|
||||
const tokens = remainder.split(',').map(t => t.trim()).filter(Boolean);
|
||||
const results: ParsedSchedule[] = [];
|
||||
|
||||
for (const token of tokens) {
|
||||
const time = normalizeTime(token);
|
||||
if (!time) continue;
|
||||
|
||||
// Anything in the token that isn't the time or period is a note
|
||||
const noteText = token
|
||||
.replace(/\d{1,2}:\d{2}\s*(am|pm|a\.m\.|p\.m\.|noon)/i, '')
|
||||
.replace(/\s+/g, ' ')
|
||||
.trim() || null;
|
||||
|
||||
results.push({
|
||||
dayOfWeek,
|
||||
time,
|
||||
language,
|
||||
notes: noteText || sectionNotes,
|
||||
});
|
||||
}
|
||||
|
||||
return results;
|
||||
}
|
||||
|
||||
// ─── Task 6: Weekday day-prefix parser ───────────────────────────────────────
|
||||
|
||||
const DAY_ABBREV: Record<string, number> = {
|
||||
mon: 1, tue: 2, wed: 3, thur: 4, thu: 4, fri: 5, sat: 6, sun: 0,
|
||||
};
|
||||
const DAY_FULL: Record<string, number> = {
|
||||
monday: 1, tuesday: 2, wednesday: 3, thursday: 4, friday: 5, saturday: 6, sunday: 0,
|
||||
};
|
||||
|
||||
function parseDays(prefix: string): number[] {
|
||||
const s = prefix.toLowerCase().replace(/\./g, '').replace(/:/g, '').trim();
|
||||
|
||||
// Range: "monday to friday" or "mon to sat"
|
||||
const rangeMatch = s.match(/(\w+)\s+to\s+(\w+)/);
|
||||
if (rangeMatch) {
|
||||
const fromDay = DAY_FULL[rangeMatch[1]] ?? DAY_ABBREV[rangeMatch[1]];
|
||||
const toDay = DAY_FULL[rangeMatch[2]] ?? DAY_ABBREV[rangeMatch[2]];
|
||||
if (fromDay !== undefined && toDay !== undefined) {
|
||||
const days: number[] = [];
|
||||
let d = fromDay;
|
||||
while (d !== toDay) { days.push(d); d = (d + 1) % 7; }
|
||||
days.push(toDay);
|
||||
return days;
|
||||
}
|
||||
}
|
||||
|
||||
// List: "mon, tue, thur" or "tue & sat"
|
||||
const tokens = s.split(/[,&\s]+/).map(t => t.trim()).filter(Boolean);
|
||||
const days = tokens
|
||||
.map(t => DAY_FULL[t] ?? DAY_ABBREV[t])
|
||||
.filter((d): d is number => d !== undefined);
|
||||
return [...new Set(days)];
|
||||
}
|
||||
|
||||
// Matches a day-prefix at the start of a token (requires trailing space/colon)
|
||||
const DAY_PREFIX_RE = /^((?:(?:Mon|Tue|Wed|Thur|Thu|Fri|Sat|Sun)\w*\.?\s*(?:[,&]\s*(?:Mon|Tue|Wed|Thur|Thu|Fri|Sat|Sun)\w*\.?\s*)*(?:to\s+\w+\.?\s*)?)|(?:(?:Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(?:\s+to\s+\w+)?))[\s:]+/i;
|
||||
|
||||
// Matches a token that is ONLY a day (or day list) with no time — e.g. "Mon." "Tue."
|
||||
const PURE_DAY_RE = /^((?:Mon|Tue|Wed|Thur|Thu|Fri|Sat|Sun)\w*\.?|(?:Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday))\.?$/i;
|
||||
|
||||
/**
|
||||
* Parse a weekday mass line that may have day prefixes.
|
||||
* Algorithm: split by comma, process each token; track current days across tokens.
|
||||
*/
|
||||
export function parseWeekdayLine(line: string): ParsedSchedule[] {
|
||||
let remainder = line.trim();
|
||||
let language = 'English';
|
||||
|
||||
const langMatch = remainder.match(/\(?(Cantonese|English|Tagalog|Chinese)\)?/i);
|
||||
if (langMatch) {
|
||||
const raw = langMatch[1].toLowerCase();
|
||||
language = raw === 'cantonese' || raw === 'chinese' ? 'Cantonese'
|
||||
: raw === 'tagalog' ? 'Tagalog' : 'English';
|
||||
remainder = remainder.replace(langMatch[0], '').replace(/\s*\(\s*$/, '').trim();
|
||||
}
|
||||
|
||||
const results: ParsedSchedule[] = [];
|
||||
const tokens = remainder.split(',').map(t => t.trim()).filter(Boolean);
|
||||
let currentDays: number[] = [1, 2, 3, 4, 5]; // default Mon–Fri
|
||||
let accumulatedDays: number[] = []; // day-only tokens accumulate here until a time appears
|
||||
|
||||
for (const token of tokens) {
|
||||
const prefixMatch = token.match(DAY_PREFIX_RE);
|
||||
if (prefixMatch) {
|
||||
const days = parseDays(prefixMatch[1]);
|
||||
const timePart = token.slice(prefixMatch[0].length);
|
||||
const time = normalizeTime(timePart);
|
||||
if (time) {
|
||||
// Merge any previously accumulated day-only tokens with this token's days
|
||||
const mergedDays = accumulatedDays.length > 0
|
||||
? [...new Set([...accumulatedDays, ...days])]
|
||||
: days.length > 0 ? days : currentDays;
|
||||
accumulatedDays = [];
|
||||
if (mergedDays.length > 0) currentDays = mergedDays;
|
||||
for (const day of currentDays) results.push({ dayOfWeek: day, time, language, notes: null });
|
||||
} else {
|
||||
// Day-only token via prefix match: accumulate
|
||||
if (days.length > 0) accumulatedDays.push(...days);
|
||||
}
|
||||
} else if (PURE_DAY_RE.test(token)) {
|
||||
// Pure day token like "Mon." "Tue." "Tuesday" — accumulate
|
||||
const days = parseDays(token);
|
||||
if (days.length > 0) accumulatedDays.push(...days);
|
||||
} else {
|
||||
const time = normalizeTime(token);
|
||||
if (time) {
|
||||
// Apply any accumulated days, then reset
|
||||
if (accumulatedDays.length > 0) {
|
||||
currentDays = [...new Set(accumulatedDays)];
|
||||
accumulatedDays = [];
|
||||
}
|
||||
for (const day of currentDays) results.push({ dayOfWeek: day, time, language, notes: null });
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return results;
|
||||
}
|
||||
|
||||
// ─── Task 7: Full entry parser ────────────────────────────────────────────────
|
||||
|
||||
const SKIP_SECTIONS = new Set(['special masses', 'eucharist adoration']);
|
||||
|
||||
/**
|
||||
* Parse a full raw entry string (including pre-marker names) into a ParsedEntry.
|
||||
*/
|
||||
export function parseEntry(raw: string): ParsedEntry {
|
||||
const markerIdx = raw.indexOf('\nPath\nClose\n');
|
||||
const pre = markerIdx >= 0 ? raw.slice(0, markerIdx) : '';
|
||||
const body = markerIdx >= 0 ? raw.slice(markerIdx + '\nPath\nClose\n'.length) : raw;
|
||||
|
||||
const { locationName, parishName } = extractNames(pre);
|
||||
const { address, phone, email } = extractFields(body);
|
||||
|
||||
const schedules: ParsedSchedule[] = [];
|
||||
|
||||
const massSectionMatch = body.match(/Mass Time\n([\s\S]*?)(?:Share\n|$)/i);
|
||||
if (massSectionMatch) {
|
||||
const massText = massSectionMatch[1];
|
||||
const lines = massText.split('\n');
|
||||
let currentSection: string | null = null;
|
||||
|
||||
for (const line of lines) {
|
||||
const trimmed = line.trim();
|
||||
if (!trimmed) continue;
|
||||
|
||||
const lower = trimmed.toLowerCase();
|
||||
|
||||
if (lower === 'sunday masses') { currentSection = 'sunday'; continue; }
|
||||
if (lower === 'anticipated sunday masses') { currentSection = 'anticipated'; continue; }
|
||||
if (lower === 'weekday masses') { currentSection = 'weekday'; continue; }
|
||||
if (SKIP_SECTIONS.has(lower)) { currentSection = 'skip'; continue; }
|
||||
|
||||
if (currentSection === 'skip') continue;
|
||||
if (currentSection === null) continue;
|
||||
|
||||
if (currentSection === 'sunday') {
|
||||
schedules.push(...parseScheduleLine(trimmed, 0));
|
||||
} else if (currentSection === 'anticipated') {
|
||||
schedules.push(...parseScheduleLine(trimmed, 6));
|
||||
} else if (currentSection === 'weekday') {
|
||||
schedules.push(...parseWeekdayLine(trimmed));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return { locationName, parishName, address, phone, email, schedules };
|
||||
}
|
||||
|
||||
// ─── Task 8: Name normalizer + matcher ───────────────────────────────────────
|
||||
|
||||
const NOISE_WORDS = new Set([
|
||||
'church', 'parish', 'chapel', 'centre', 'center', 'mass',
|
||||
'saint', 'st', 'our', 'lady', 'of', 'the', 'a', 'an',
|
||||
]);
|
||||
|
||||
/**
|
||||
* Normalise a church name for comparison:
|
||||
* lowercase, strip accents, remove noise words, collapse whitespace.
|
||||
*/
|
||||
export function normalizeName(name: string): string {
|
||||
return name
|
||||
.toLowerCase()
|
||||
.normalize('NFD').replace(/[\u0300-\u036f]/g, '')
|
||||
.replace(/[^a-z0-9\s]/g, ' ')
|
||||
.split(/\s+/)
|
||||
.filter(w => w.length >= 2 && !NOISE_WORDS.has(w))
|
||||
.join(' ')
|
||||
.trim();
|
||||
}
|
||||
|
||||
function wordOverlap(a: string, b: string): number {
|
||||
const setA = new Set(a.split(' ').filter(Boolean));
|
||||
const setB = new Set(b.split(' ').filter(Boolean));
|
||||
if (setA.size === 0 || setB.size === 0) return 0;
|
||||
let intersection = 0;
|
||||
for (const w of setA) if (setB.has(w)) intersection++;
|
||||
const union = setA.size + setB.size - intersection;
|
||||
return intersection / union;
|
||||
}
|
||||
|
||||
/**
|
||||
* Find the best-matching existing church for a parsed entry.
|
||||
* Returns null if no match meets the threshold.
|
||||
*/
|
||||
export function findMatch(
|
||||
locationName: string,
|
||||
address: string | null,
|
||||
existing: ExistingChurch[]
|
||||
): ExistingChurch | null {
|
||||
const normTarget = normalizeName(locationName);
|
||||
let best: ExistingChurch | null = null;
|
||||
let bestScore = 0;
|
||||
|
||||
for (const church of existing) {
|
||||
const normExisting = normalizeName(church.name);
|
||||
const score = wordOverlap(normTarget, normExisting);
|
||||
|
||||
if (score > bestScore) {
|
||||
bestScore = score;
|
||||
best = church;
|
||||
}
|
||||
}
|
||||
|
||||
if (bestScore >= 0.4) return best;
|
||||
|
||||
// Fallback: address prefix match (first 12 chars)
|
||||
if (address && address.length >= 5) {
|
||||
const addrPrefix = address.slice(0, 12).toLowerCase();
|
||||
for (const church of existing) {
|
||||
if (church.address?.toLowerCase().includes(addrPrefix)) return church;
|
||||
}
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
// ─── DB Operations ────────────────────────────────────────────────────────────
|
||||
|
||||
async function upsertChurch(
|
||||
entry: ParsedEntry,
|
||||
matched: ExistingChurch | null,
|
||||
dryRun: boolean,
|
||||
stats: ImportStats
|
||||
): Promise<void> {
|
||||
const tag = matched ? `[MATCH] ${matched.name} ← ${entry.locationName}` : `[NEW] ${entry.locationName}`;
|
||||
const schedCount = entry.schedules.length;
|
||||
|
||||
if (dryRun) {
|
||||
console.log(tag);
|
||||
if (!matched && entry.address) console.log(` Address: ${entry.address}`);
|
||||
console.log(` ${schedCount} schedules`);
|
||||
if (matched) stats.matched++; else stats.created++;
|
||||
stats.schedulesWritten += schedCount;
|
||||
return;
|
||||
}
|
||||
|
||||
if (matched) {
|
||||
const update: Record<string, string> = {};
|
||||
if (!matched.phone && entry.phone) update.phone = entry.phone;
|
||||
if (!matched.email && entry.email) update.email = entry.email;
|
||||
|
||||
await prisma.$transaction(async tx => {
|
||||
if (Object.keys(update).length > 0) {
|
||||
await tx.church.update({ where: { id: matched.id }, data: update });
|
||||
}
|
||||
await tx.massSchedule.deleteMany({ where: { churchId: matched.id } });
|
||||
if (entry.schedules.length > 0) {
|
||||
await tx.massSchedule.createMany({
|
||||
data: entry.schedules.map(s => ({
|
||||
churchId: matched.id,
|
||||
dayOfWeek: s.dayOfWeek,
|
||||
time: s.time,
|
||||
language: s.language,
|
||||
notes: s.notes ?? null,
|
||||
})),
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
stats.matched++;
|
||||
} else {
|
||||
const newChurch = await prisma.church.create({
|
||||
data: {
|
||||
name: entry.locationName,
|
||||
country: 'HK',
|
||||
source: 'diocese-hk',
|
||||
address: entry.address ?? undefined,
|
||||
phone: entry.phone ?? undefined,
|
||||
email: entry.email ?? undefined,
|
||||
latitude: 0,
|
||||
longitude: 0,
|
||||
hasWebsite: false,
|
||||
},
|
||||
});
|
||||
|
||||
if (entry.schedules.length > 0) {
|
||||
await prisma.massSchedule.createMany({
|
||||
data: entry.schedules.map(s => ({
|
||||
churchId: newChurch.id,
|
||||
dayOfWeek: s.dayOfWeek,
|
||||
time: s.time,
|
||||
language: s.language,
|
||||
notes: s.notes ?? null,
|
||||
})),
|
||||
});
|
||||
}
|
||||
|
||||
stats.created++;
|
||||
}
|
||||
|
||||
stats.schedulesWritten += schedCount;
|
||||
console.log(tag);
|
||||
}
|
||||
|
||||
// ─── Main ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
const dryRun = args.includes('--dry-run');
|
||||
const fileArgIdx = args.indexOf('--file');
|
||||
const filePath = fileArgIdx >= 0 ? args[fileArgIdx + 1] : path.resolve(process.cwd(), 'scripts/hk-parishes.txt');
|
||||
|
||||
console.log(`\n${'='.repeat(60)}`);
|
||||
console.log(`HK Diocese Parish Import`);
|
||||
console.log(`File: ${filePath}`);
|
||||
console.log(`Dry run: ${dryRun ? 'Yes' : 'No'}`);
|
||||
console.log(`${'='.repeat(60)}\n`);
|
||||
|
||||
const raw = fs.readFileSync(filePath, 'utf-8');
|
||||
const entryStrings = splitEntries(raw);
|
||||
console.log(`Found ${entryStrings.length} entries in file\n`);
|
||||
|
||||
const existing = await prisma.church.findMany({
|
||||
where: { country: 'HK' },
|
||||
select: { id: true, name: true, address: true, phone: true, email: true },
|
||||
});
|
||||
console.log(`Loaded ${existing.length} existing HK churches\n`);
|
||||
|
||||
const stats: ImportStats = { matched: 0, created: 0, schedulesWritten: 0, skipped: 0 };
|
||||
|
||||
for (const entryStr of entryStrings) {
|
||||
let entry: ParsedEntry;
|
||||
try {
|
||||
entry = parseEntry(entryStr);
|
||||
} catch (err) {
|
||||
console.warn(`[SKIP] Failed to parse entry: ${(err as Error).message}`);
|
||||
stats.skipped++;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (!entry.locationName || entry.locationName === 'Unknown') {
|
||||
stats.skipped++;
|
||||
continue;
|
||||
}
|
||||
|
||||
const matched = findMatch(entry.locationName, entry.address, existing);
|
||||
await upsertChurch(entry, matched, dryRun, stats);
|
||||
}
|
||||
|
||||
console.log(`\n${'='.repeat(60)}`);
|
||||
console.log(`Import Summary`);
|
||||
console.log(`${'='.repeat(60)}`);
|
||||
console.log(`Matched existing: ${stats.matched}`);
|
||||
console.log(`New churches: ${stats.created}`);
|
||||
console.log(`Skipped: ${stats.skipped}`);
|
||||
console.log(`Schedules written: ${stats.schedulesWritten}`);
|
||||
console.log(`${'='.repeat(60)}\n`);
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
// Only run when executed directly (not imported by tests)
|
||||
import { fileURLToPath } from 'url';
|
||||
if (process.argv[1] === fileURLToPath(import.meta.url)) {
|
||||
main().catch(err => {
|
||||
console.error('Fatal error:', err);
|
||||
process.exit(1);
|
||||
});
|
||||
}
|
||||
@@ -570,7 +570,6 @@ async function loadExistingSpanishChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -783,7 +782,6 @@ async function processChurch(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'horariosmisas',
|
||||
website: parsed.website,
|
||||
phone: parsed.phone,
|
||||
|
||||
@@ -343,7 +343,6 @@ async function loadExistingBelgianChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -490,8 +489,6 @@ async function processChurch(
|
||||
bohosluzbyId: null,
|
||||
miserendId: null,
|
||||
kerknetId,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'kerknet',
|
||||
website: church.website,
|
||||
phone: null,
|
||||
|
||||
@@ -290,7 +290,6 @@ async function loadExistingPhilippineChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -465,7 +464,6 @@ async function processChurch(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'mass-schedules-ph',
|
||||
website: null,
|
||||
phone: parsed.phone,
|
||||
|
||||
@@ -398,7 +398,6 @@ async function loadExistingChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -596,7 +595,7 @@ async function main() {
|
||||
orarimesseId: null, massSchedulesPhId: null,
|
||||
philmassId: null, horariosMisasId: null,
|
||||
mszeInfoId: null, weekdayMassesId: null,
|
||||
messesInfoId: null, bohosluzbyId: null, miserendId: null, kerknetId: null, gottesdienstzeitenId: null, discovermassId: null,
|
||||
messesInfoId: null, bohosluzbyId: null, miserendId: null, kerknetId: null, gottesdienstzeitenId: null,
|
||||
source: 'masstimes', website: mc.url?.trim() || null,
|
||||
phone: mc.phone_number?.trim() || null, address, country,
|
||||
});
|
||||
|
||||
@@ -326,7 +326,6 @@ async function loadExistingFrenchChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -480,7 +479,6 @@ async function processDiocese(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'messes-info',
|
||||
website: null,
|
||||
phone: null,
|
||||
|
||||
@@ -240,7 +240,6 @@ async function loadExistingChurches(countryCodes: string[]): Promise<ExistingChu
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -366,7 +365,6 @@ async function processChurch(
|
||||
miserendId,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'miserend',
|
||||
website: null,
|
||||
phone: null,
|
||||
|
||||
@@ -367,7 +367,6 @@ async function loadExistingPolishChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -538,7 +537,6 @@ async function processChurch(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'msze-info',
|
||||
website: parsed.website,
|
||||
phone: parsed.phone,
|
||||
|
||||
@@ -283,7 +283,6 @@ async function loadExistingItalianChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -425,7 +424,6 @@ async function processChurchesForDiocese(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'orarimesse',
|
||||
website: church.sito || null,
|
||||
phone: null,
|
||||
@@ -493,10 +491,14 @@ async function processSchedulesForDiocese(
|
||||
})),
|
||||
});
|
||||
|
||||
// Mark church as scraped
|
||||
// Update church metadata from detail (pastor, phone) if available
|
||||
const churchUpdateData: Record<string, unknown> = { lastScrapedAt: new Date() };
|
||||
if (detail.parroco) churchUpdateData.pastorName = detail.parroco;
|
||||
if (detail.telefono) churchUpdateData.phone = detail.telefono;
|
||||
|
||||
await tx.church.update({
|
||||
where: { id: dbId },
|
||||
data: { lastScrapedAt: new Date() },
|
||||
data: churchUpdateData,
|
||||
});
|
||||
});
|
||||
|
||||
|
||||
@@ -204,7 +204,6 @@ async function importFromOSM(countryCode: string, dryRun: boolean = false): Prom
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -380,7 +379,6 @@ async function importFromOSM(countryCode: string, dryRun: boolean = false): Prom
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'osm',
|
||||
website: osmChurch.website || null,
|
||||
phone: osmChurch.phone || null,
|
||||
|
||||
@@ -152,7 +152,6 @@ async function importFromRegion(countryCode: string, regionName: string, dryRun:
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -256,7 +255,6 @@ async function importFromRegion(countryCode: string, regionName: string, dryRun:
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'osm',
|
||||
website: osmChurch.website || null,
|
||||
phone: osmChurch.phone || null,
|
||||
|
||||
@@ -301,7 +301,6 @@ async function loadExistingPhilippineChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
|
||||
@@ -822,7 +822,6 @@ async function loadExistingChurches(): Promise<ExistingChurch[]> {
|
||||
miserendId: true,
|
||||
kerknetId: true,
|
||||
gottesdienstzeitenId: true,
|
||||
discovermassId: true,
|
||||
source: true,
|
||||
website: true,
|
||||
phone: true,
|
||||
@@ -982,7 +981,6 @@ async function importAreaBlocks(
|
||||
miserendId: null,
|
||||
kerknetId: null,
|
||||
gottesdienstzeitenId: null,
|
||||
discovermassId: null,
|
||||
source: 'weekdaymasses',
|
||||
website: church.website,
|
||||
phone: church.phone,
|
||||
|
||||
623
scripts/match-search-results.ts
Normal file
623
scripts/match-search-results.ts
Normal file
@@ -0,0 +1,623 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Second-pass matching: analyze stored ChromaDB search results to find websites
|
||||
* that the FreeSearch first pass missed.
|
||||
*
|
||||
* Usage:
|
||||
* npx tsx scripts/match-search-results.ts --dry-run
|
||||
* npx tsx scripts/match-search-results.ts --country IT --limit 100
|
||||
* npx tsx scripts/match-search-results.ts --threshold 0.3
|
||||
*
|
||||
* Algorithm:
|
||||
* 1. Get churches without websites that have been FreeSearch'd
|
||||
* 2. Query ChromaDB search_results collection for semantically similar results
|
||||
* 3. Cross-church matching: URLs from nearby churches may match
|
||||
* 4. URL frequency analysis: URLs appearing for multiple churches in same area
|
||||
* 5. Verify best candidates against page content
|
||||
* 6. Update church.website if verified
|
||||
*/
|
||||
|
||||
import dotenv from 'dotenv';
|
||||
import path from 'path';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { Collection } from 'chromadb';
|
||||
import axios from 'axios';
|
||||
import { getCollection, COLLECTION_NAMES } from '../src/chromadb/collections';
|
||||
import { embedSingle } from '../src/chromadb/embeddings';
|
||||
|
||||
// Fresh DB connection
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
// --- Job Tracking ---
|
||||
async function createOrResumeJob(args: string[]): Promise<string | null> {
|
||||
const jobIdIndex = args.indexOf('--job-id');
|
||||
if (jobIdIndex !== -1) {
|
||||
const jobId = args[jobIdIndex + 1];
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: { status: 'running', startedAt: new Date() },
|
||||
});
|
||||
return jobId;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
async function createNewJob(config: Record<string, unknown>): Promise<string> {
|
||||
const job = await prisma.backgroundJob.create({
|
||||
data: {
|
||||
type: 'match-search-results',
|
||||
status: 'running',
|
||||
startedAt: new Date(),
|
||||
config,
|
||||
},
|
||||
});
|
||||
return job.id;
|
||||
}
|
||||
|
||||
async function updateJobProgress(jobId: string, processed: number, found: number, total: number): Promise<void> {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: { processed, succeeded: found, totalItems: total },
|
||||
});
|
||||
}
|
||||
|
||||
async function checkJobStopping(jobId: string): Promise<boolean> {
|
||||
const job = await prisma.backgroundJob.findUnique({ where: { id: jobId } });
|
||||
return job?.status === 'stopping';
|
||||
}
|
||||
|
||||
async function completeJob(jobId: string, error?: string): Promise<void> {
|
||||
await prisma.backgroundJob.update({
|
||||
where: { id: jobId },
|
||||
data: {
|
||||
status: error ? 'failed' : 'completed',
|
||||
error,
|
||||
completedAt: new Date(),
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
// --- Types ---
|
||||
|
||||
interface ChurchRecord {
|
||||
id: string;
|
||||
name: string;
|
||||
address: string | null;
|
||||
city: string | null;
|
||||
state: string | null;
|
||||
country: string;
|
||||
latitude: number;
|
||||
longitude: number;
|
||||
}
|
||||
|
||||
interface MatchStats {
|
||||
processed: number;
|
||||
matched: number;
|
||||
noResults: number;
|
||||
verifyFailed: number;
|
||||
errors: number;
|
||||
startTime: number;
|
||||
}
|
||||
|
||||
// --- Helpers ---
|
||||
|
||||
let shuttingDown = false;
|
||||
|
||||
function log(msg: string) {
|
||||
console.log(`[${new Date().toISOString()}] ${msg}`);
|
||||
}
|
||||
|
||||
function logError(msg: string) {
|
||||
console.error(`[${new Date().toISOString()}] ${msg}`);
|
||||
}
|
||||
|
||||
function normalizeForMatch(str: string): string {
|
||||
return str.toLowerCase()
|
||||
.replace(/[^a-z0-9\s]/g, '')
|
||||
.replace(/\s+/g, ' ')
|
||||
.trim();
|
||||
}
|
||||
|
||||
const CATHOLIC_KEYWORDS = [
|
||||
'parish', 'church', 'catholic', 'parroquia', 'paroisse', 'pfarrei',
|
||||
'parafia', 'paroquia', 'parrocchia', 'farnost', 'plebania', 'parochie',
|
||||
'župnija', 'farnosť', 'iglesia', 'église', 'kirche', 'kościół',
|
||||
'chiesa', 'kostel', 'templom', 'kerk',
|
||||
];
|
||||
|
||||
const MASS_SCHEDULE_KEYWORDS = [
|
||||
'mass schedule', 'mass times', 'worship schedule', 'worship times',
|
||||
'service times', 'sunday mass', 'weekday mass',
|
||||
'horario de misas', 'horarios de misa', 'horaires des messes',
|
||||
'gottesdienst', 'gottesdienstzeiten', 'messzeiten',
|
||||
'msze święte', 'godziny mszy', 'msze św',
|
||||
'orari delle messe', 'orario messe',
|
||||
'horário das missas',
|
||||
];
|
||||
|
||||
const TOURISM_KEYWORDS = [
|
||||
'tourism', 'turismo', 'tourisme', 'turisme', 'touristik', 'turistico',
|
||||
'attractions', 'things to do', 'sightseeing', 'sehenswürdigkeiten',
|
||||
'what to see', 'places to visit', 'travel guide', 'reiseführer',
|
||||
'patrimoine', 'heritage trail', 'cultural heritage',
|
||||
'punto de interés', 'point of interest', 'points of interest',
|
||||
];
|
||||
|
||||
function getSignificantWords(name: string): string[] {
|
||||
const stopWords = new Set([
|
||||
'the', 'of', 'and', 'in', 'at', 'for', 'our', 'lady',
|
||||
'st', 'saint', 'saints', 'san', 'sant', 'santa', 'santo', 'sacred',
|
||||
'christ', 'jesus', 'mary', 'maria', 'king', 'lord', 'heart',
|
||||
'cross', 'lady', 'queen', 'angel', 'angels', 'good', 'star',
|
||||
'nome', 'pere', 'madre', 'notre', 'dame', 'bien',
|
||||
'onze', 'lieve', 'vrouw', 'heer',
|
||||
'rosa', 'paul', 'anne', 'jean', 'joan', 'luke', 'marc',
|
||||
'rita', 'jose', 'leon', 'pius', 'roch', 'yves', 'ines',
|
||||
'vita', 'fara', 'bona',
|
||||
'cristo', 'fatima', 'lourdes', 'perpetuo', 'socorro', 'calvario',
|
||||
'rosario', 'pilar', 'carmen', 'dolores', 'remedios', 'nieves',
|
||||
'grotte', 'mission', 'sagrada', 'sagrado', 'familia',
|
||||
'guadalupe', 'assumption', 'immaculate', 'perpetual', 'divine',
|
||||
'knights', 'columbus',
|
||||
'house', 'home', 'hall', 'center', 'centre', 'centro',
|
||||
'deacon', 'priest', 'bishop', 'father', 'sister', 'brother',
|
||||
'school', 'academy', 'college', 'seminary', 'rectory', 'retreat',
|
||||
'church', 'parish', 'catholic', 'roman', 'holy', 'chapel',
|
||||
'cathedral', 'basilica', 'shrine', 'convent', 'monastery',
|
||||
'chapelle', 'eglise', 'église', 'paroisse', 'couvent', 'grotte',
|
||||
'iglesia', 'parroquia', 'capilla', 'ermita', 'convento', 'basílica',
|
||||
'kirche', 'kapelle', 'pfarrei', 'kloster',
|
||||
'chiesa', 'parrocchia', 'cappella', 'oratorio',
|
||||
'igreja', 'capela', 'paroquia',
|
||||
'kościół', 'kaplica', 'parafia', 'droga',
|
||||
'kostel', 'kaple', 'farnost', 'templom', 'kápolna',
|
||||
'de', 'la', 'le', 'les', 'du', 'des', 'el', 'los', 'las',
|
||||
'di', 'del', 'della', 'delle', 'degli',
|
||||
'do', 'da', 'dos', 'das',
|
||||
'und', 'der', 'die', 'das', 'von',
|
||||
'nad', 'pod', 'przy',
|
||||
]);
|
||||
|
||||
return normalizeForMatch(name)
|
||||
.split(' ')
|
||||
.filter(w => w.length >= 3 && !stopWords.has(w));
|
||||
}
|
||||
|
||||
function stripHtml(html: string): string {
|
||||
return html
|
||||
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
|
||||
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
|
||||
.replace(/<[^>]+>/g, ' ')
|
||||
.replace(/&[a-z]+;/gi, ' ')
|
||||
.replace(/\s+/g, ' ')
|
||||
.toLowerCase();
|
||||
}
|
||||
|
||||
// --- URL Verification (same logic as enrich-with-freesearch.ts) ---
|
||||
|
||||
async function verifyUrl(url: string, church: ChurchRecord): Promise<boolean> {
|
||||
try {
|
||||
const response = await axios.get(url, {
|
||||
timeout: 10000,
|
||||
maxRedirects: 3,
|
||||
headers: {
|
||||
'User-Agent': 'Mozilla/5.0 (compatible; NearestMass/1.0; +https://nearestmass.com)',
|
||||
'Accept': 'text/html',
|
||||
},
|
||||
maxContentLength: 200000,
|
||||
responseType: 'text',
|
||||
});
|
||||
|
||||
if (typeof response.data !== 'string') return false;
|
||||
|
||||
const text = stripHtml(response.data);
|
||||
const nameWords = getSignificantWords(church.name);
|
||||
|
||||
let nameMatches = 0;
|
||||
for (const word of nameWords) {
|
||||
if (text.includes(word)) nameMatches++;
|
||||
}
|
||||
|
||||
let cityMatch = false;
|
||||
if (church.city) {
|
||||
const cityNorm = normalizeForMatch(church.city);
|
||||
if (cityNorm.length > 2 && text.includes(cityNorm)) cityMatch = true;
|
||||
}
|
||||
|
||||
let addressMatch = false;
|
||||
if (church.address) {
|
||||
const addrNorm = normalizeForMatch(church.address);
|
||||
const addrWords = addrNorm.split(' ').filter(w => w.length >= 4 && !/^\d+$/.test(w));
|
||||
let addrWordMatches = 0;
|
||||
for (const w of addrWords) {
|
||||
if (text.includes(w)) addrWordMatches++;
|
||||
}
|
||||
if (addrWordMatches >= 2) addressMatch = true;
|
||||
}
|
||||
|
||||
let hasCatholicKeyword = false;
|
||||
for (const kw of CATHOLIC_KEYWORDS) {
|
||||
if (text.includes(kw)) { hasCatholicKeyword = true; break; }
|
||||
}
|
||||
|
||||
let hasMassSchedule = false;
|
||||
for (const kw of MASS_SCHEDULE_KEYWORDS) {
|
||||
if (text.includes(kw)) { hasMassSchedule = true; break; }
|
||||
}
|
||||
|
||||
let isTourismPage = false;
|
||||
for (const kw of TOURISM_KEYWORDS) {
|
||||
if (text.includes(kw)) { isTourismPage = true; break; }
|
||||
}
|
||||
|
||||
let domainMatchesName = false;
|
||||
try {
|
||||
const hostname = new URL(url).hostname.toLowerCase();
|
||||
for (const word of nameWords) {
|
||||
if (word.length >= 4 && hostname.includes(word)) {
|
||||
domainMatchesName = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
} catch { /* ignore */ }
|
||||
|
||||
if (isTourismPage && !hasMassSchedule) return false;
|
||||
|
||||
let isDeepUrl = false;
|
||||
try {
|
||||
const pathSegments = new URL(url).pathname.split('/').filter(Boolean);
|
||||
isDeepUrl = pathSegments.length > 2;
|
||||
} catch { /* ignore */ }
|
||||
if (isDeepUrl && !domainMatchesName && !hasMassSchedule) return false;
|
||||
|
||||
const hasCity = !!(church.city && church.city.trim());
|
||||
|
||||
if (hasMassSchedule && nameMatches >= 1) return true;
|
||||
if (domainMatchesName && nameMatches >= 1 && hasCatholicKeyword) return true;
|
||||
|
||||
if (hasCity) {
|
||||
if (nameMatches >= 2) return true;
|
||||
if (nameMatches >= 1 && cityMatch) return true;
|
||||
if (nameMatches >= 1 && addressMatch) return true;
|
||||
}
|
||||
|
||||
if (!hasCity) {
|
||||
if (nameMatches >= 1 && addressMatch) return true;
|
||||
if (nameMatches >= 3) return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
// --- ChromaDB Querying ---
|
||||
|
||||
interface ChromaResult {
|
||||
id: string;
|
||||
url: string;
|
||||
title: string;
|
||||
score: number;
|
||||
distance: number;
|
||||
churchId: string;
|
||||
churchName: string;
|
||||
churchCity: string;
|
||||
verified?: boolean;
|
||||
}
|
||||
|
||||
async function findCandidatesForChurch(
|
||||
church: ChurchRecord,
|
||||
collection: Collection,
|
||||
threshold: number,
|
||||
nResults: number
|
||||
): Promise<ChromaResult[]> {
|
||||
// Build identity text for semantic search
|
||||
const identityText = `${church.name} ${church.address || ''} ${church.city || ''} ${church.country}`.trim();
|
||||
const queryEmbedding = await embedSingle(identityText);
|
||||
|
||||
const results = await collection.query({
|
||||
queryEmbeddings: [queryEmbedding],
|
||||
nResults,
|
||||
where: { churchCountry: church.country },
|
||||
});
|
||||
|
||||
if (!results.ids[0]) return [];
|
||||
|
||||
return results.ids[0]
|
||||
.map((id, i) => {
|
||||
const metadata = results.metadatas[0][i] as Record<string, unknown>;
|
||||
return {
|
||||
id,
|
||||
url: (metadata.resultUrl as string) || '',
|
||||
title: (metadata.resultTitle as string) || '',
|
||||
score: (metadata.score as number) || 0,
|
||||
distance: results.distances?.[0]?.[i] ?? 1,
|
||||
churchId: (metadata.churchId as string) || '',
|
||||
churchName: (metadata.churchName as string) || '',
|
||||
churchCity: (metadata.churchCity as string) || '',
|
||||
verified: (metadata.verified as boolean) || false,
|
||||
};
|
||||
})
|
||||
.filter(r => r.distance <= threshold && r.url);
|
||||
}
|
||||
|
||||
function deduplicateByUrl(results: ChromaResult[]): ChromaResult[] {
|
||||
const seen = new Map<string, ChromaResult>();
|
||||
for (const r of results) {
|
||||
const existing = seen.get(r.url);
|
||||
if (!existing || r.distance < existing.distance) {
|
||||
seen.set(r.url, r);
|
||||
}
|
||||
}
|
||||
return [...seen.values()].sort((a, b) => a.distance - b.distance);
|
||||
}
|
||||
|
||||
// --- Main Processing ---
|
||||
|
||||
async function processChurch(
|
||||
church: ChurchRecord,
|
||||
collection: Collection,
|
||||
stats: MatchStats,
|
||||
threshold: number,
|
||||
dryRun: boolean
|
||||
): Promise<void> {
|
||||
const label = `${church.name} (${church.city || 'unknown'}, ${church.country})`;
|
||||
|
||||
try {
|
||||
// 1. Semantic search for similar results in ChromaDB
|
||||
const candidates = await findCandidatesForChurch(church, collection, threshold, 20);
|
||||
|
||||
if (candidates.length === 0) {
|
||||
log(` - ${label} => no ChromaDB results within threshold`);
|
||||
stats.noResults++;
|
||||
return;
|
||||
}
|
||||
|
||||
// 2. Separate results: own church vs cross-church
|
||||
const ownResults = candidates.filter(r => r.churchId === church.id);
|
||||
const crossResults = candidates.filter(r => r.churchId !== church.id);
|
||||
|
||||
// 3. URL frequency: URLs appearing for multiple churches are likely real parish/diocese sites
|
||||
const urlFrequency = new Map<string, number>();
|
||||
for (const r of candidates) {
|
||||
urlFrequency.set(r.url, (urlFrequency.get(r.url) || 0) + 1);
|
||||
}
|
||||
|
||||
// 4. Prioritize: already-verified URLs from other churches, then high-frequency URLs,
|
||||
// then own-church results, then cross-church results
|
||||
const verifiedFromOthers = crossResults.filter(r => r.verified);
|
||||
const highFreqUrls = [...urlFrequency.entries()]
|
||||
.filter(([, count]) => count >= 2)
|
||||
.map(([url]) => url);
|
||||
|
||||
// Build candidate list in priority order
|
||||
const urlsToTry: string[] = [];
|
||||
const addUrl = (url: string) => {
|
||||
if (!urlsToTry.includes(url)) urlsToTry.push(url);
|
||||
};
|
||||
|
||||
// Verified URLs from nearby churches (highest priority)
|
||||
for (const r of verifiedFromOthers) addUrl(r.url);
|
||||
|
||||
// High-frequency URLs (appear in results for multiple churches)
|
||||
for (const url of highFreqUrls) addUrl(url);
|
||||
|
||||
// Own church results by distance (closest semantic match first)
|
||||
const dedupedOwn = deduplicateByUrl(ownResults);
|
||||
for (const r of dedupedOwn) addUrl(r.url);
|
||||
|
||||
// Cross-church results from same city
|
||||
const sameCityCross = crossResults.filter(r =>
|
||||
church.city && r.churchCity &&
|
||||
normalizeForMatch(r.churchCity) === normalizeForMatch(church.city)
|
||||
);
|
||||
const dedupedCross = deduplicateByUrl(sameCityCross);
|
||||
for (const r of dedupedCross) addUrl(r.url);
|
||||
|
||||
// Limit to top 5 candidates
|
||||
const topUrls = urlsToTry.slice(0, 5);
|
||||
|
||||
log(` ? ${label} => ${candidates.length} results, trying ${topUrls.length} candidates`);
|
||||
|
||||
// 5. Verify each candidate
|
||||
let verifiedUrl: string | null = null;
|
||||
for (const url of topUrls) {
|
||||
const ok = await verifyUrl(url, church);
|
||||
if (ok) {
|
||||
verifiedUrl = url;
|
||||
break;
|
||||
} else {
|
||||
stats.verifyFailed++;
|
||||
}
|
||||
}
|
||||
|
||||
if (verifiedUrl) {
|
||||
log(` + ${label} => ${verifiedUrl}`);
|
||||
stats.matched++;
|
||||
if (!dryRun) {
|
||||
await prisma.church.update({
|
||||
where: { id: church.id },
|
||||
data: {
|
||||
website: verifiedUrl,
|
||||
hasWebsite: true,
|
||||
},
|
||||
});
|
||||
// Mark in ChromaDB (update replaces metadata, so include all fields)
|
||||
try {
|
||||
const matchingResult = candidates.find(r => r.url === verifiedUrl);
|
||||
if (matchingResult) {
|
||||
await collection.update({
|
||||
ids: [matchingResult.id],
|
||||
metadatas: [{
|
||||
churchId: matchingResult.churchId,
|
||||
churchName: matchingResult.churchName,
|
||||
churchCity: matchingResult.churchCity,
|
||||
churchCountry: church.country,
|
||||
searchQuery: '',
|
||||
resultUrl: verifiedUrl,
|
||||
resultTitle: matchingResult.title || '',
|
||||
score: matchingResult.score || 0,
|
||||
verified: true,
|
||||
}],
|
||||
});
|
||||
}
|
||||
} catch { /* ignore */ }
|
||||
}
|
||||
} else {
|
||||
log(` ~ ${label} => ${topUrls.length} candidates failed verification`);
|
||||
stats.noResults++;
|
||||
}
|
||||
} catch (error: any) {
|
||||
stats.errors++;
|
||||
logError(` ! ${label} => error: ${error.message}`);
|
||||
}
|
||||
}
|
||||
|
||||
// --- Main ---
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
const countryIndex = args.indexOf('--country');
|
||||
const limitIndex = args.indexOf('--limit');
|
||||
const thresholdIndex = args.indexOf('--threshold');
|
||||
const dryRun = args.includes('--dry-run');
|
||||
|
||||
const countryCode = countryIndex !== -1 ? args[countryIndex + 1] : undefined;
|
||||
const limit = limitIndex !== -1 ? parseInt(args[limitIndex + 1]) : 500;
|
||||
const threshold = thresholdIndex !== -1 ? parseFloat(args[thresholdIndex + 1]) : 0.4;
|
||||
|
||||
// Graceful shutdown
|
||||
process.on('SIGTERM', () => { log('Received SIGTERM'); shuttingDown = true; });
|
||||
process.on('SIGINT', () => { log('Received SIGINT'); shuttingDown = true; });
|
||||
|
||||
log('============================================================');
|
||||
log('Second-Pass Search Result Matching');
|
||||
log('============================================================');
|
||||
log(`Country: ${countryCode || 'All'}`);
|
||||
log(`Limit: ${limit}`);
|
||||
log(`Threshold: ${threshold}`);
|
||||
log(`Dry run: ${dryRun ? 'Yes' : 'No'}`);
|
||||
log('============================================================');
|
||||
|
||||
// Connect to ChromaDB
|
||||
let collection: Collection;
|
||||
try {
|
||||
collection = await getCollection(COLLECTION_NAMES.SEARCH_RESULTS);
|
||||
log('ChromaDB search_results collection connected');
|
||||
} catch (e: any) {
|
||||
logError(`ChromaDB unavailable: ${e.message}`);
|
||||
logError('This script requires ChromaDB. Make sure it is running.');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
// Check collection has data
|
||||
const count = await collection.count();
|
||||
log(`ChromaDB search_results: ${count} entries`);
|
||||
if (count === 0) {
|
||||
log('No search results stored yet. Run enrich-with-freesearch.ts first.');
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
// Job tracking
|
||||
let jobId = await createOrResumeJob(args);
|
||||
if (!jobId) {
|
||||
jobId = await createNewJob({ countryCode, limit, threshold, dryRun });
|
||||
}
|
||||
log(`Job ID: ${jobId}`);
|
||||
|
||||
// Get churches without websites that have been FreeSearch'd
|
||||
const whereClause: Record<string, unknown> = {
|
||||
source: 'osm',
|
||||
website: null,
|
||||
freeSearchedAt: { not: null },
|
||||
};
|
||||
if (countryCode) {
|
||||
(whereClause as any).country = countryCode;
|
||||
}
|
||||
|
||||
const churches = await prisma.church.findMany({
|
||||
where: whereClause as any,
|
||||
select: {
|
||||
id: true, name: true, address: true, city: true, state: true,
|
||||
country: true, latitude: true, longitude: true,
|
||||
},
|
||||
take: limit,
|
||||
orderBy: { updatedAt: 'asc' },
|
||||
});
|
||||
|
||||
log(`Found ${churches.length} churches without websites (already FreeSearch'd)`);
|
||||
|
||||
const stats: MatchStats = {
|
||||
processed: 0,
|
||||
matched: 0,
|
||||
noResults: 0,
|
||||
verifyFailed: 0,
|
||||
errors: 0,
|
||||
startTime: Date.now(),
|
||||
};
|
||||
|
||||
for (const church of churches) {
|
||||
if (shuttingDown) break;
|
||||
stats.processed++;
|
||||
|
||||
await processChurch(church, collection, stats, threshold, dryRun);
|
||||
|
||||
// Job tracking every 10 items
|
||||
if (jobId && stats.processed % 10 === 0) {
|
||||
await updateJobProgress(jobId, stats.processed, stats.matched, churches.length);
|
||||
const stopping = await checkJobStopping(jobId);
|
||||
if (stopping) {
|
||||
log('Job stop requested via admin dashboard.');
|
||||
shuttingDown = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Progress logging every 50 items
|
||||
if (stats.processed % 50 === 0) {
|
||||
const elapsed = (Date.now() - stats.startTime) / 1000;
|
||||
const rate = Math.round((stats.processed / elapsed) * 3600);
|
||||
log(`Progress: ${stats.processed}/${churches.length} processed, ${stats.matched} matched, ${stats.noResults} no match, ${stats.errors} errors (~${rate}/hour)`);
|
||||
}
|
||||
}
|
||||
|
||||
// Complete job
|
||||
if (jobId) {
|
||||
await updateJobProgress(jobId, stats.processed, stats.matched, churches.length);
|
||||
await completeJob(jobId);
|
||||
}
|
||||
|
||||
// Print summary
|
||||
const elapsed = ((Date.now() - stats.startTime) / 1000).toFixed(1);
|
||||
const matchRate = stats.processed > 0
|
||||
? ((stats.matched / stats.processed) * 100).toFixed(1)
|
||||
: '0.0';
|
||||
|
||||
log('');
|
||||
log('============================================================');
|
||||
log('Second-Pass Matching Summary');
|
||||
log('============================================================');
|
||||
log(`Churches processed: ${stats.processed}`);
|
||||
log(`Websites matched: ${stats.matched}`);
|
||||
log(`No match found: ${stats.noResults}`);
|
||||
log(`Verify rejected: ${stats.verifyFailed}`);
|
||||
log(`Errors: ${stats.errors}`);
|
||||
log(`Match rate: ${matchRate}%`);
|
||||
log(`Elapsed: ${elapsed}s`);
|
||||
log('============================================================');
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
logError(`Fatal error: ${error.message}`);
|
||||
process.exit(1);
|
||||
});
|
||||
110
scripts/normalize-country-codes.ts
Normal file
110
scripts/normalize-country-codes.ts
Normal file
@@ -0,0 +1,110 @@
|
||||
/**
|
||||
* Normalize country codes in the database.
|
||||
* Converts full country names to ISO 3166-1 alpha-2 codes.
|
||||
*
|
||||
* Usage:
|
||||
* npx tsx scripts/normalize-country-codes.ts --dry-run
|
||||
* npx tsx scripts/normalize-country-codes.ts --execute
|
||||
*/
|
||||
|
||||
import path from 'path';
|
||||
import dotenv from 'dotenv';
|
||||
dotenv.config({ path: path.resolve(process.cwd(), '.env') });
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { normalizeCountryCode } from '../src/lib/country-normalize';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
async function main() {
|
||||
const dryRun = !process.argv.includes('--execute');
|
||||
|
||||
if (dryRun) {
|
||||
console.log('DRY RUN — no changes will be made. Use --execute to apply.\n');
|
||||
}
|
||||
|
||||
// Get all distinct country values
|
||||
const countries = await prisma.church.findMany({
|
||||
select: { country: true },
|
||||
distinct: ['country'],
|
||||
where: { country: { not: null } },
|
||||
});
|
||||
|
||||
const countryValues = countries
|
||||
.map(c => c.country)
|
||||
.filter((c): c is string => c !== null);
|
||||
|
||||
console.log(`Found ${countryValues.length} distinct country values.\n`);
|
||||
|
||||
// Group by normalization result
|
||||
const changes: { original: string; normalized: string; count?: number }[] = [];
|
||||
const alreadyNormalized: string[] = [];
|
||||
const unknown: string[] = [];
|
||||
|
||||
for (const country of countryValues) {
|
||||
const normalized = normalizeCountryCode(country);
|
||||
|
||||
if (normalized === country) {
|
||||
// Already correct or unknown
|
||||
if (country.length === 2 && country === country.toUpperCase()) {
|
||||
alreadyNormalized.push(country);
|
||||
} else {
|
||||
unknown.push(country);
|
||||
}
|
||||
} else {
|
||||
changes.push({ original: country, normalized });
|
||||
}
|
||||
}
|
||||
|
||||
// Get counts for changes
|
||||
for (const change of changes) {
|
||||
const count = await prisma.church.count({
|
||||
where: { country: change.original },
|
||||
});
|
||||
change.count = count;
|
||||
}
|
||||
|
||||
// Report
|
||||
console.log(`Already normalized (${alreadyNormalized.length}): ${alreadyNormalized.sort().join(', ')}\n`);
|
||||
|
||||
if (changes.length > 0) {
|
||||
console.log(`Changes to apply (${changes.length}):`);
|
||||
for (const { original, normalized, count } of changes) {
|
||||
console.log(` "${original}" → "${normalized}" (${count} churches)`);
|
||||
}
|
||||
console.log();
|
||||
} else {
|
||||
console.log('No changes needed — all country values are already normalized.\n');
|
||||
}
|
||||
|
||||
if (unknown.length > 0) {
|
||||
console.log(`Unknown values (${unknown.length}): ${unknown.join(', ')}`);
|
||||
console.log(' These could not be mapped to ISO codes. Review manually.\n');
|
||||
}
|
||||
|
||||
// Apply changes
|
||||
if (!dryRun && changes.length > 0) {
|
||||
let totalUpdated = 0;
|
||||
for (const { original, normalized } of changes) {
|
||||
const result = await prisma.church.updateMany({
|
||||
where: { country: original },
|
||||
data: { country: normalized },
|
||||
});
|
||||
totalUpdated += result.count;
|
||||
console.log(`Updated "${original}" → "${normalized}": ${result.count} churches`);
|
||||
}
|
||||
console.log(`\nTotal updated: ${totalUpdated} churches`);
|
||||
}
|
||||
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch(err => {
|
||||
console.error('Error:', err);
|
||||
process.exit(1);
|
||||
});
|
||||
197
scripts/populate-chromadb.ts
Normal file
197
scripts/populate-chromadb.ts
Normal file
@@ -0,0 +1,197 @@
|
||||
/**
|
||||
* Bulk-populate ChromaDB collections from the database.
|
||||
*
|
||||
* Usage:
|
||||
* npx tsx scripts/populate-chromadb.ts --collection church_identity
|
||||
* npx tsx scripts/populate-chromadb.ts --collection page_classification
|
||||
* npx tsx scripts/populate-chromadb.ts --all
|
||||
* npx tsx scripts/populate-chromadb.ts --all --batch-size 50 --limit 1000
|
||||
*/
|
||||
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
import { getCollection, COLLECTION_NAMES, CollectionName } from '../src/chromadb/collections';
|
||||
import { embed } from '../src/chromadb/embeddings';
|
||||
|
||||
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
|
||||
const adapter = new PrismaPg(pool);
|
||||
const prisma = new PrismaClient({ adapter });
|
||||
|
||||
const args = process.argv.slice(2);
|
||||
const collectionArg = args.includes('--collection')
|
||||
? args[args.indexOf('--collection') + 1]
|
||||
: null;
|
||||
const populateAll = args.includes('--all');
|
||||
const batchSize = args.includes('--batch-size')
|
||||
? parseInt(args[args.indexOf('--batch-size') + 1])
|
||||
: 100;
|
||||
const limit = args.includes('--limit')
|
||||
? parseInt(args[args.indexOf('--limit') + 1])
|
||||
: 0;
|
||||
|
||||
async function populateChurchIdentity() {
|
||||
console.log('\n=== Populating church_identity ===');
|
||||
const collection = await getCollection(COLLECTION_NAMES.CHURCH_IDENTITY);
|
||||
|
||||
const totalCount = await prisma.church.count();
|
||||
const maxItems = limit > 0 ? Math.min(limit, totalCount) : totalCount;
|
||||
console.log(`Total churches: ${totalCount}, processing: ${maxItems}`);
|
||||
|
||||
let processed = 0;
|
||||
let cursor: string | undefined = undefined;
|
||||
|
||||
while (processed < maxItems) {
|
||||
const currentBatch = Math.min(batchSize, maxItems - processed);
|
||||
const churches = await prisma.church.findMany({
|
||||
take: currentBatch,
|
||||
...(cursor ? { skip: 1, cursor: { id: cursor } } : {}),
|
||||
orderBy: { id: 'asc' },
|
||||
select: {
|
||||
id: true,
|
||||
name: true,
|
||||
address: true,
|
||||
city: true,
|
||||
country: true,
|
||||
source: true,
|
||||
latitude: true,
|
||||
longitude: true,
|
||||
},
|
||||
});
|
||||
|
||||
if (churches.length === 0) break;
|
||||
|
||||
const documents = churches.map(
|
||||
(c) => `${c.name} ${c.address || ''} ${c.city || ''} ${c.country}`.trim()
|
||||
);
|
||||
|
||||
const embeddings = await embed(documents);
|
||||
|
||||
await collection.upsert({
|
||||
ids: churches.map((c) => `church-${c.id}`),
|
||||
embeddings,
|
||||
documents,
|
||||
metadatas: churches.map((c) => ({
|
||||
churchId: c.id,
|
||||
country: c.country,
|
||||
source: c.source,
|
||||
lat: c.latitude,
|
||||
lng: c.longitude,
|
||||
})),
|
||||
});
|
||||
|
||||
processed += churches.length;
|
||||
cursor = churches[churches.length - 1].id;
|
||||
console.log(` Processed ${processed}/${maxItems}`);
|
||||
}
|
||||
|
||||
console.log(` Done: ${processed} churches indexed`);
|
||||
}
|
||||
|
||||
async function populatePageClassification() {
|
||||
console.log('\n=== Populating page_classification ===');
|
||||
const collection = await getCollection(COLLECTION_NAMES.PAGE_CLASSIFICATION);
|
||||
|
||||
// Index churches that have been successfully scraped (have mass schedules)
|
||||
const totalCount = await prisma.church.count({
|
||||
where: {
|
||||
lastScrapedAt: { not: null },
|
||||
massSchedules: { some: { isActive: true } },
|
||||
},
|
||||
});
|
||||
const maxItems = limit > 0 ? Math.min(limit, totalCount) : totalCount;
|
||||
console.log(`Scraped churches with schedules: ${totalCount}, processing: ${maxItems}`);
|
||||
|
||||
let processed = 0;
|
||||
let cursor: string | undefined = undefined;
|
||||
|
||||
while (processed < maxItems) {
|
||||
const currentBatch = Math.min(batchSize, maxItems - processed);
|
||||
const churches = await prisma.church.findMany({
|
||||
take: currentBatch,
|
||||
...(cursor ? { skip: 1, cursor: { id: cursor } } : {}),
|
||||
where: {
|
||||
lastScrapedAt: { not: null },
|
||||
massSchedules: { some: { isActive: true } },
|
||||
},
|
||||
orderBy: { id: 'asc' },
|
||||
select: {
|
||||
id: true,
|
||||
massScheduleUrl: true,
|
||||
website: true,
|
||||
websiteLanguage: true,
|
||||
scraperConfig: { select: { rawHtml: true } },
|
||||
},
|
||||
});
|
||||
|
||||
if (churches.length === 0) break;
|
||||
|
||||
// Use stored raw HTML (truncated) as the document
|
||||
const validChurches = churches.filter((c) => c.scraperConfig?.rawHtml);
|
||||
if (validChurches.length > 0) {
|
||||
const documents = validChurches.map(
|
||||
(c) => (c.scraperConfig?.rawHtml || '').slice(0, 2000)
|
||||
);
|
||||
|
||||
const embeddings = await embed(documents);
|
||||
|
||||
await collection.upsert({
|
||||
ids: validChurches.map((c) => `page-${c.id}`),
|
||||
embeddings,
|
||||
documents,
|
||||
metadatas: validChurches.map((c) => ({
|
||||
url: c.massScheduleUrl || c.website || '',
|
||||
isMassSchedulePage: true,
|
||||
language: c.websiteLanguage || 'unknown',
|
||||
})),
|
||||
});
|
||||
}
|
||||
|
||||
processed += churches.length;
|
||||
cursor = churches[churches.length - 1].id;
|
||||
console.log(` Processed ${processed}/${maxItems} (${validChurches.length} had raw HTML)`);
|
||||
}
|
||||
|
||||
console.log(` Done: ${processed} pages classified`);
|
||||
}
|
||||
|
||||
async function main() {
|
||||
try {
|
||||
if (!populateAll && !collectionArg) {
|
||||
console.log('Usage:');
|
||||
console.log(' npx tsx scripts/populate-chromadb.ts --collection church_identity');
|
||||
console.log(' npx tsx scripts/populate-chromadb.ts --collection page_classification');
|
||||
console.log(' npx tsx scripts/populate-chromadb.ts --all');
|
||||
console.log(' npx tsx scripts/populate-chromadb.ts --all --batch-size 50 --limit 1000');
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
const collectionsToPopulate: CollectionName[] = populateAll
|
||||
? [COLLECTION_NAMES.CHURCH_IDENTITY, COLLECTION_NAMES.PAGE_CLASSIFICATION]
|
||||
: [collectionArg as CollectionName];
|
||||
|
||||
for (const name of collectionsToPopulate) {
|
||||
switch (name) {
|
||||
case COLLECTION_NAMES.CHURCH_IDENTITY:
|
||||
await populateChurchIdentity();
|
||||
break;
|
||||
case COLLECTION_NAMES.PAGE_CLASSIFICATION:
|
||||
await populatePageClassification();
|
||||
break;
|
||||
default:
|
||||
console.log(`Collection '${name}' does not have a populate function yet.`);
|
||||
console.log('Available: church_identity, page_classification');
|
||||
}
|
||||
}
|
||||
|
||||
console.log('\nPopulation complete!');
|
||||
} catch (error) {
|
||||
console.error('Error:', error);
|
||||
process.exit(1);
|
||||
} finally {
|
||||
await prisma.$disconnect();
|
||||
await pool.end();
|
||||
}
|
||||
}
|
||||
|
||||
main();
|
||||
54
scripts/populate-city-normalized.ts
Normal file
54
scripts/populate-city-normalized.ts
Normal file
@@ -0,0 +1,54 @@
|
||||
import { config } from 'dotenv';
|
||||
import { Pool } from 'pg';
|
||||
import { PrismaPg } from '@prisma/adapter-pg';
|
||||
import { PrismaClient } from '@prisma/client';
|
||||
|
||||
// Load environment variables
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
// Create connection pool
|
||||
const connectionString = process.env.DATABASE_URL || '';
|
||||
const pool = new Pool({ connectionString });
|
||||
|
||||
// Create Prisma adapter
|
||||
const adapter = new PrismaPg(pool);
|
||||
|
||||
// Create Prisma client with adapter
|
||||
const prisma = new PrismaClient({
|
||||
adapter,
|
||||
log: ['error'],
|
||||
});
|
||||
|
||||
async function main() {
|
||||
console.log('Populating cityNormalized field using SQL...');
|
||||
|
||||
// Use raw SQL for much faster batch update
|
||||
// Normalize: lowercase, remove special chars except spaces/numbers, trim
|
||||
const result = await prisma.$executeRaw`
|
||||
UPDATE churches
|
||||
SET city_normalized = LOWER(
|
||||
TRIM(
|
||||
REGEXP_REPLACE(
|
||||
COALESCE(city, ''),
|
||||
'[^a-zA-Z0-9 ]',
|
||||
'',
|
||||
'g'
|
||||
)
|
||||
)
|
||||
)
|
||||
WHERE city IS NOT NULL
|
||||
`;
|
||||
|
||||
console.log(`✅ Updated ${result} churches with normalized cities`);
|
||||
}
|
||||
|
||||
main()
|
||||
.then(async () => {
|
||||
await prisma.$disconnect();
|
||||
})
|
||||
.catch(async (e) => {
|
||||
console.error(e);
|
||||
await prisma.$disconnect();
|
||||
process.exit(1);
|
||||
});
|
||||
161
scripts/save-schedules-to-db.ts
Normal file
161
scripts/save-schedules-to-db.ts
Normal file
@@ -0,0 +1,161 @@
|
||||
#!/usr/bin/env tsx
|
||||
/**
|
||||
* Save mass schedules to database using scrapeChurch() service
|
||||
*/
|
||||
|
||||
import { config } from 'dotenv';
|
||||
config({ path: '.env.local' });
|
||||
config({ path: '.env' });
|
||||
|
||||
import { scrapeChurch } from '../src/lib/scraper-service';
|
||||
import { prisma } from '../src/lib/db';
|
||||
|
||||
const PRIORITY_COUNTRIES = ['FR', 'DE', 'ES', 'PL', 'BR'];
|
||||
const CHURCHES_PER_COUNTRY = 5; // Start small to verify it works
|
||||
|
||||
interface ScrapeResult {
|
||||
churchId: string;
|
||||
churchName: string;
|
||||
country: string;
|
||||
success: boolean;
|
||||
schedulesCreated: number;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
async function saveSchedulesToDb() {
|
||||
console.log('Starting database save operation...\n');
|
||||
console.log(`Target: ${CHURCHES_PER_COUNTRY} churches per country`);
|
||||
console.log(`Countries: ${PRIORITY_COUNTRIES.join(', ')}\n`);
|
||||
|
||||
const results: ScrapeResult[] = [];
|
||||
let totalChurches = 0;
|
||||
let totalSuccess = 0;
|
||||
let totalSchedules = 0;
|
||||
|
||||
for (const country of PRIORITY_COUNTRIES) {
|
||||
console.log(`\n${'='.repeat(60)}`);
|
||||
console.log(`${country} - Finding churches to scrape...`);
|
||||
console.log('='.repeat(60));
|
||||
|
||||
// Get churches with websites that haven't been scraped yet
|
||||
const churches = await prisma.church.findMany({
|
||||
where: {
|
||||
country,
|
||||
website: { not: null },
|
||||
source: 'osm',
|
||||
lastScrapedAt: null, // Only unscrapped churches
|
||||
},
|
||||
take: CHURCHES_PER_COUNTRY,
|
||||
orderBy: { createdAt: 'asc' },
|
||||
});
|
||||
|
||||
console.log(`Found ${churches.length} churches to scrape\n`);
|
||||
|
||||
for (let i = 0; i < churches.length; i++) {
|
||||
const church = churches[i];
|
||||
totalChurches++;
|
||||
|
||||
process.stdout.write(`[${i + 1}/${churches.length}] ${church.name.substring(0, 40).padEnd(40)} `);
|
||||
|
||||
try {
|
||||
// Use the scrapeChurch service which saves to database
|
||||
const result = await scrapeChurch(church.id);
|
||||
|
||||
if (result.success) {
|
||||
totalSuccess++;
|
||||
totalSchedules += result.schedulesCreated;
|
||||
process.stdout.write(`✅ ${result.schedulesCreated} schedules saved\n`);
|
||||
|
||||
results.push({
|
||||
churchId: church.id,
|
||||
churchName: church.name,
|
||||
country,
|
||||
success: true,
|
||||
schedulesCreated: result.schedulesCreated,
|
||||
});
|
||||
} else {
|
||||
process.stdout.write(`❌ ${result.error}\n`);
|
||||
|
||||
results.push({
|
||||
churchId: church.id,
|
||||
churchName: church.name,
|
||||
country,
|
||||
success: false,
|
||||
schedulesCreated: 0,
|
||||
error: result.error,
|
||||
});
|
||||
}
|
||||
} catch (err: any) {
|
||||
process.stdout.write(`❌ ERROR: ${err.message}\n`);
|
||||
|
||||
results.push({
|
||||
churchId: church.id,
|
||||
churchName: church.name,
|
||||
country,
|
||||
success: false,
|
||||
schedulesCreated: 0,
|
||||
error: err.message,
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Final summary
|
||||
console.log('\n\n');
|
||||
console.log('═'.repeat(80));
|
||||
console.log('DATABASE SAVE SUMMARY');
|
||||
console.log('═'.repeat(80));
|
||||
console.log('');
|
||||
console.log(`Total churches processed: ${totalChurches}`);
|
||||
console.log(`Successful scrapes: ${totalSuccess} (${((totalSuccess / totalChurches) * 100).toFixed(1)}%)`);
|
||||
console.log(`Total schedules saved to database: ${totalSchedules}`);
|
||||
console.log('');
|
||||
|
||||
// Verify database records
|
||||
console.log('Verifying database records...\n');
|
||||
|
||||
const dbScheduleCount = await prisma.massSchedule.count();
|
||||
const dbChurchesWithSchedules = await prisma.church.count({
|
||||
where: {
|
||||
massSchedules: {
|
||||
some: {},
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
console.log(`✓ Total mass schedules in database: ${dbScheduleCount}`);
|
||||
console.log(`✓ Churches with schedules: ${dbChurchesWithSchedules}`);
|
||||
console.log('');
|
||||
|
||||
// Show sample of saved schedules
|
||||
console.log('Sample of saved schedules:\n');
|
||||
|
||||
const sampleChurches = await prisma.church.findMany({
|
||||
where: {
|
||||
massSchedules: {
|
||||
some: {},
|
||||
},
|
||||
},
|
||||
include: {
|
||||
massSchedules: {
|
||||
take: 3,
|
||||
orderBy: { dayOfWeek: 'asc' },
|
||||
},
|
||||
},
|
||||
take: 3,
|
||||
});
|
||||
|
||||
const dayNames = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'];
|
||||
|
||||
sampleChurches.forEach(church => {
|
||||
console.log(`${church.name} (${church.country}):`);
|
||||
church.massSchedules.forEach(schedule => {
|
||||
console.log(` ${dayNames[schedule.dayOfWeek]} ${schedule.time} - ${schedule.language} ${schedule.massType || ''}`);
|
||||
});
|
||||
console.log('');
|
||||
});
|
||||
|
||||
await prisma.$disconnect();
|
||||
}
|
||||
|
||||
saveSchedulesToDb().catch(console.error);
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user