From da4aa618603511365e64e96b1c156636839bcd33 Mon Sep 17 00:00:00 2001 From: albertfj114 Date: Sat, 28 Mar 2026 08:38:17 -0400 Subject: [PATCH] docs: add freesearch stability & scheduler healthcheck design spec Co-Authored-By: Claude Sonnet 4.6 --- .../2026-03-28-freesearch-stability-design.md | 103 ++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 docs/superpowers/specs/2026-03-28-freesearch-stability-design.md diff --git a/docs/superpowers/specs/2026-03-28-freesearch-stability-design.md b/docs/superpowers/specs/2026-03-28-freesearch-stability-design.md new file mode 100644 index 0000000..d427a22 --- /dev/null +++ b/docs/superpowers/specs/2026-03-28-freesearch-stability-design.md @@ -0,0 +1,103 @@ +# FreeSearch Stability & Scheduler Healthcheck Fix + +**Date:** 2026-03-28 +**Status:** Approved +**Scope:** `scripts/enrich-with-freesearch.ts`, `scripts/scheduler.ts`, `docker-compose.yml` + +--- + +## Problem Summary + +Three related infrastructure reliability issues identified during health check: + +1. **FreeSearch crash loop** — `freesearch-enrichment` container restarts every ~60s because startup health check calls `process.exit(1)` when FreeSearch API is unreachable. The circuit breaker (which handles mid-run outages) lives inside `runContinuous()` and is never reached. + +2. **Stale running jobs** — Each container restart creates a new `freesearch-enrichment` DB job without cleaning up the previous `running` one. Two jobs from Mar 22 and Mar 26 are permanently stuck as `running`. + +3. **Scheduler healthcheck failing** — `node:20-bookworm-slim` does not include `procps`/`pgrep`. The healthcheck command `pgrep -f scheduler.ts` exits 1 silently → scheduler shows as `unhealthy` despite working correctly. + +--- + +## Fix 1: FreeSearch Startup Resilience + +### Change + +Replace the `process.exit(1)` startup health check in `main()` with a `waitForFreeSearch()` function. + +### Behavior + +- Polls `GET /api/health` with exponential backoff: 30s → 60s → 120s → 240s → cap at 300s (5 min) +- Waits indefinitely — container stays alive until FreeSearch comes back +- Logs each attempt: `"FreeSearch not reachable, retrying in 120s..."` +- Logs recovery: `"FreeSearch is back, continuing..."` +- Proceeds to job setup and `runContinuous()` once health check passes + +### Stale job cleanup (same function) + +Before creating a new DB job in `main()`, run a cleanup: + +```typescript +await prisma.backgroundJob.updateMany({ + where: { type: 'freesearch-enrichment', status: 'running' }, + data: { status: 'failed', error: 'Container restarted', completedAt: new Date() }, +}); +``` + +This fixes the two existing stuck jobs and prevents the pattern from recurring on future restarts. + +### Files changed + +- `scripts/enrich-with-freesearch.ts`: ~25 lines + +--- + +## Fix 2: Scheduler Healthcheck + +### Change + +Replace `pgrep`-based healthcheck with a heartbeat file approach. + +**In `scheduler.ts`:** Add `writeHeartbeat()` call inside the existing hourly cron handler. Writes current ISO timestamp to `/app/logs/scheduler.heartbeat`. + +**In `docker-compose.yml`:** Replace healthcheck: + +```yaml +# Before +test: ["CMD-SHELL", "pgrep -f scheduler.ts || exit 1"] +interval: 60s +timeout: 10s +retries: 3 +start_period: 30s + +# After +test: ["CMD-SHELL", "find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . || exit 1"] +interval: 90s +timeout: 10s +retries: 3 +start_period: 90s +``` + +The `./logs` volume is already mounted. `start_period: 90s` avoids false alarms before the first cron tick. + +### Files changed + +- `scripts/scheduler.ts`: ~5 lines +- `docker-compose.yml`: 4 lines + +--- + +## Fix 3: Deploy + +```bash +bash scripts/deploy-local.sh +docker compose -f /opt/docker/scraper-control/docker-compose.yml restart freesearch-enrichment scheduler +``` + +--- + +## Success Criteria + +- `freesearch-enrichment` container stays running even when FreeSearch is down, resumes enrichment when it comes back +- No new stale `running` freesearch-enrichment jobs after container restarts +- `scheduler` container shows as `healthy` in `docker ps` +- No behavioral changes to enrichment logic itself