Files
ScraperControl/docs/superpowers/specs/2026-03-28-freesearch-stability-design.md

104 lines
3.3 KiB
Markdown
Raw Permalink Normal View History

# FreeSearch Stability & Scheduler Healthcheck Fix
**Date:** 2026-03-28
**Status:** Approved
**Scope:** `scripts/enrich-with-freesearch.ts`, `scripts/scheduler.ts`, `docker-compose.yml`
---
## Problem Summary
Three related infrastructure reliability issues identified during health check:
1. **FreeSearch crash loop**`freesearch-enrichment` container restarts every ~60s because startup health check calls `process.exit(1)` when FreeSearch API is unreachable. The circuit breaker (which handles mid-run outages) lives inside `runContinuous()` and is never reached.
2. **Stale running jobs** — Each container restart creates a new `freesearch-enrichment` DB job without cleaning up the previous `running` one. Two jobs from Mar 22 and Mar 26 are permanently stuck as `running`.
3. **Scheduler healthcheck failing**`node:20-bookworm-slim` does not include `procps`/`pgrep`. The healthcheck command `pgrep -f scheduler.ts` exits 1 silently → scheduler shows as `unhealthy` despite working correctly.
---
## Fix 1: FreeSearch Startup Resilience
### Change
Replace the `process.exit(1)` startup health check in `main()` with a `waitForFreeSearch()` function.
### Behavior
- Polls `GET /api/health` with exponential backoff: 30s → 60s → 120s → 240s → cap at 300s (5 min)
- Waits indefinitely — container stays alive until FreeSearch comes back
- Logs each attempt: `"FreeSearch not reachable, retrying in 120s..."`
- Logs recovery: `"FreeSearch is back, continuing..."`
- Proceeds to job setup and `runContinuous()` once health check passes
### Stale job cleanup (same function)
Before creating a new DB job in `main()`, run a cleanup:
```typescript
await prisma.backgroundJob.updateMany({
where: { type: 'freesearch-enrichment', status: 'running' },
data: { status: 'failed', error: 'Container restarted', completedAt: new Date() },
});
```
This fixes the two existing stuck jobs and prevents the pattern from recurring on future restarts.
### Files changed
- `scripts/enrich-with-freesearch.ts`: ~25 lines
---
## Fix 2: Scheduler Healthcheck
### Change
Replace `pgrep`-based healthcheck with a heartbeat file approach.
**In `scheduler.ts`:** Add `writeHeartbeat()` call inside the existing hourly cron handler. Writes current ISO timestamp to `/app/logs/scheduler.heartbeat`.
**In `docker-compose.yml`:** Replace healthcheck:
```yaml
# Before
test: ["CMD-SHELL", "pgrep -f scheduler.ts || exit 1"]
interval: 60s
timeout: 10s
retries: 3
start_period: 30s
# After
test: ["CMD-SHELL", "find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . || exit 1"]
interval: 90s
timeout: 10s
retries: 3
start_period: 90s
```
The `./logs` volume is already mounted. `start_period: 90s` avoids false alarms before the first cron tick.
### Files changed
- `scripts/scheduler.ts`: ~5 lines
- `docker-compose.yml`: 4 lines
---
## Fix 3: Deploy
```bash
bash scripts/deploy-local.sh
docker compose -f /opt/docker/scraper-control/docker-compose.yml restart freesearch-enrichment scheduler
```
---
## Success Criteria
- `freesearch-enrichment` container stays running even when FreeSearch is down, resumes enrichment when it comes back
- No new stale `running` freesearch-enrichment jobs after container restarts
- `scheduler` container shows as `healthy` in `docker ps`
- No behavioral changes to enrichment logic itself