Files
ScraperControl/docs/superpowers/specs/2026-03-28-freesearch-stability-design.md
2026-03-28 08:38:17 -04:00

3.3 KiB

FreeSearch Stability & Scheduler Healthcheck Fix

Date: 2026-03-28 Status: Approved Scope: scripts/enrich-with-freesearch.ts, scripts/scheduler.ts, docker-compose.yml


Problem Summary

Three related infrastructure reliability issues identified during health check:

  1. FreeSearch crash loopfreesearch-enrichment container restarts every ~60s because startup health check calls process.exit(1) when FreeSearch API is unreachable. The circuit breaker (which handles mid-run outages) lives inside runContinuous() and is never reached.

  2. Stale running jobs — Each container restart creates a new freesearch-enrichment DB job without cleaning up the previous running one. Two jobs from Mar 22 and Mar 26 are permanently stuck as running.

  3. Scheduler healthcheck failingnode:20-bookworm-slim does not include procps/pgrep. The healthcheck command pgrep -f scheduler.ts exits 1 silently → scheduler shows as unhealthy despite working correctly.


Fix 1: FreeSearch Startup Resilience

Change

Replace the process.exit(1) startup health check in main() with a waitForFreeSearch() function.

Behavior

  • Polls GET /api/health with exponential backoff: 30s → 60s → 120s → 240s → cap at 300s (5 min)
  • Waits indefinitely — container stays alive until FreeSearch comes back
  • Logs each attempt: "FreeSearch not reachable, retrying in 120s..."
  • Logs recovery: "FreeSearch is back, continuing..."
  • Proceeds to job setup and runContinuous() once health check passes

Stale job cleanup (same function)

Before creating a new DB job in main(), run a cleanup:

await prisma.backgroundJob.updateMany({
  where: { type: 'freesearch-enrichment', status: 'running' },
  data: { status: 'failed', error: 'Container restarted', completedAt: new Date() },
});

This fixes the two existing stuck jobs and prevents the pattern from recurring on future restarts.

Files changed

  • scripts/enrich-with-freesearch.ts: ~25 lines

Fix 2: Scheduler Healthcheck

Change

Replace pgrep-based healthcheck with a heartbeat file approach.

In scheduler.ts: Add writeHeartbeat() call inside the existing hourly cron handler. Writes current ISO timestamp to /app/logs/scheduler.heartbeat.

In docker-compose.yml: Replace healthcheck:

# Before
test: ["CMD-SHELL", "pgrep -f scheduler.ts || exit 1"]
interval: 60s
timeout: 10s
retries: 3
start_period: 30s

# After
test: ["CMD-SHELL", "find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . || exit 1"]
interval: 90s
timeout: 10s
retries: 3
start_period: 90s

The ./logs volume is already mounted. start_period: 90s avoids false alarms before the first cron tick.

Files changed

  • scripts/scheduler.ts: ~5 lines
  • docker-compose.yml: 4 lines

Fix 3: Deploy

bash scripts/deploy-local.sh
docker compose -f /opt/docker/scraper-control/docker-compose.yml restart freesearch-enrichment scheduler

Success Criteria

  • freesearch-enrichment container stays running even when FreeSearch is down, resumes enrichment when it comes back
  • No new stale running freesearch-enrichment jobs after container restarts
  • scheduler container shows as healthy in docker ps
  • No behavioral changes to enrichment logic itself