docs: add freesearch stability implementation plan
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
309
docs/superpowers/plans/2026-03-28-freesearch-stability.md
Normal file
309
docs/superpowers/plans/2026-03-28-freesearch-stability.md
Normal file
@@ -0,0 +1,309 @@
|
|||||||
|
# FreeSearch Stability & Scheduler Healthcheck Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Make the `freesearch-enrichment` container stay alive when FreeSearch is down, clean up stale running jobs on restart, and fix the scheduler's perpetually-failing Docker healthcheck.
|
||||||
|
|
||||||
|
**Architecture:** Three targeted edits across two scripts and docker-compose. `enrich-with-freesearch.ts` gets a `waitForFreeSearch()` startup loop and a stale-job cleanup before job creation. `scheduler.ts` writes a heartbeat file on each hourly cron tick. `docker-compose.yml` swaps the `pgrep` healthcheck for a file-age check on that heartbeat file.
|
||||||
|
|
||||||
|
**Tech Stack:** TypeScript/tsx, Prisma, Docker Compose, node-cron, bash (healthcheck command)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
- Modify: `scripts/enrich-with-freesearch.ts:872-880` — add `waitForFreeSearch()` function
|
||||||
|
- Modify: `scripts/enrich-with-freesearch.ts:1272-1296` — replace startup exit with wait call + stale job cleanup
|
||||||
|
- Modify: `scripts/scheduler.ts:747-758` — write heartbeat file in hourly cron
|
||||||
|
- Modify: `docker-compose.yml:275-280` — replace scheduler healthcheck
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Add `waitForFreeSearch()` to the enrichment script
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/enrich-with-freesearch.ts`
|
||||||
|
|
||||||
|
The existing `healthCheck()` function (line 872) returns a boolean. We add `waitForFreeSearch()` directly below it — a loop that calls `healthCheck()` and sleeps with exponential backoff until it succeeds.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add `waitForFreeSearch()` after `healthCheck()`**
|
||||||
|
|
||||||
|
In `scripts/enrich-with-freesearch.ts`, find this block (around line 872):
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
async function healthCheck(): Promise<boolean> {
|
||||||
|
try {
|
||||||
|
const resp = await axios.get(`${FREESEARCH_URL}/api/health`, { timeout: 5000 });
|
||||||
|
return resp.status === 200;
|
||||||
|
} catch {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Add the following function immediately after it:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
async function waitForFreeSearch(): Promise<void> {
|
||||||
|
let backoffMs = 30_000;
|
||||||
|
const maxBackoffMs = 300_000; // 5 minutes
|
||||||
|
let attempt = 0;
|
||||||
|
|
||||||
|
while (!shuttingDown) {
|
||||||
|
attempt++;
|
||||||
|
const healthy = await healthCheck();
|
||||||
|
if (healthy) {
|
||||||
|
if (attempt > 1) log('FreeSearch is back. Continuing...');
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
const waitSec = Math.round(backoffMs / 1000);
|
||||||
|
logError(`FreeSearch not reachable at ${FREESEARCH_URL} (attempt ${attempt}). Retrying in ${waitSec}s...`);
|
||||||
|
await sleep(backoffMs);
|
||||||
|
backoffMs = Math.min(backoffMs * 2, maxBackoffMs);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Replace the startup health check block in `main()`**
|
||||||
|
|
||||||
|
Find this block in `main()` (around line 1272):
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// Health check
|
||||||
|
log('Checking FreeSearch health...');
|
||||||
|
const healthy = await healthCheck();
|
||||||
|
if (!healthy) {
|
||||||
|
logError(`FreeSearch not reachable at ${FREESEARCH_URL}`);
|
||||||
|
logError('Make sure FreeSearch is running and accessible.');
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
log('FreeSearch health check: OK');
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace with:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// Wait for FreeSearch to be reachable (indefinite retry with backoff)
|
||||||
|
log('Waiting for FreeSearch to be reachable...');
|
||||||
|
await waitForFreeSearch();
|
||||||
|
if (shuttingDown) return;
|
||||||
|
log('FreeSearch health check: OK');
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add stale job cleanup before job creation**
|
||||||
|
|
||||||
|
Find this block in `main()` (around line 1291):
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// Job tracking
|
||||||
|
let jobId = await createOrResumeJob(args);
|
||||||
|
if (!jobId) {
|
||||||
|
jobId = await createNewJob({ countryCode, limit, continuous, dryRun, reSearch });
|
||||||
|
}
|
||||||
|
log(`Job ID: ${jobId}`);
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace with:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// Job tracking — clean up any running jobs left by a previous container restart
|
||||||
|
await prisma.backgroundJob.updateMany({
|
||||||
|
where: { type: 'freesearch-enrichment', status: 'running' },
|
||||||
|
data: { status: 'failed', error: 'Container restarted', completedAt: new Date() },
|
||||||
|
});
|
||||||
|
|
||||||
|
let jobId = await createOrResumeJob(args);
|
||||||
|
if (!jobId) {
|
||||||
|
jobId = await createNewJob({ countryCode, limit, continuous, dryRun, reSearch });
|
||||||
|
}
|
||||||
|
log(`Job ID: ${jobId}`);
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify the script compiles**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /home/albert/Documents/ScraperControl
|
||||||
|
npx tsc --noEmit
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: no errors (or only pre-existing errors unrelated to this change).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/enrich-with-freesearch.ts
|
||||||
|
git commit -m "fix: wait for FreeSearch on startup instead of exiting; clean stale jobs"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Write heartbeat file in scheduler
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/scheduler.ts`
|
||||||
|
|
||||||
|
The scheduler already has an hourly cron that logs a heartbeat message (lines 747-758). We add a single `fs.writeFileSync` call inside it to write the timestamp to `/app/logs/scheduler.heartbeat`. The `logs/` directory is already created by `ensureLogsDir()` at startup.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add heartbeat file write inside the hourly cron**
|
||||||
|
|
||||||
|
Find this block in `scripts/scheduler.ts` (around line 747):
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// Heartbeat every hour — logs cycle state
|
||||||
|
cron.schedule('0 * * * *', () => {
|
||||||
|
const currentGroup = cycleState.currentGroupIndex < PIPELINE_GROUPS.length
|
||||||
|
? PIPELINE_GROUPS[cycleState.currentGroupIndex].name
|
||||||
|
: 'none';
|
||||||
|
const jobs = runningJobs.size > 0
|
||||||
|
? `Running: ${[...runningJobs.keys()].join(', ')}`
|
||||||
|
: 'No jobs running';
|
||||||
|
const state = cycleState.waitingForCooldown
|
||||||
|
? 'cooldown'
|
||||||
|
: `group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length} (${currentGroup})`;
|
||||||
|
log(`Heartbeat: Cycle ${cycleState.cycleNumber + 1}, ${state}. ${jobs}`);
|
||||||
|
}, { timezone: 'UTC' });
|
||||||
|
log('Registered cron job: heartbeat (hourly)');
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace with:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// Heartbeat every hour — logs cycle state and writes heartbeat file for Docker healthcheck
|
||||||
|
cron.schedule('0 * * * *', () => {
|
||||||
|
const currentGroup = cycleState.currentGroupIndex < PIPELINE_GROUPS.length
|
||||||
|
? PIPELINE_GROUPS[cycleState.currentGroupIndex].name
|
||||||
|
: 'none';
|
||||||
|
const jobs = runningJobs.size > 0
|
||||||
|
? `Running: ${[...runningJobs.keys()].join(', ')}`
|
||||||
|
: 'No jobs running';
|
||||||
|
const state = cycleState.waitingForCooldown
|
||||||
|
? 'cooldown'
|
||||||
|
: `group ${cycleState.currentGroupIndex + 1}/${PIPELINE_GROUPS.length} (${currentGroup})`;
|
||||||
|
log(`Heartbeat: Cycle ${cycleState.cycleNumber + 1}, ${state}. ${jobs}`);
|
||||||
|
fs.writeFileSync(path.join(LOGS_DIR, 'scheduler.heartbeat'), new Date().toISOString());
|
||||||
|
}, { timezone: 'UTC' });
|
||||||
|
log('Registered cron job: heartbeat (hourly)');
|
||||||
|
```
|
||||||
|
|
||||||
|
`fs` and `path` are already imported in `scheduler.ts`. `LOGS_DIR` is already defined as `'/app/logs'`.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify the script compiles**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /home/albert/Documents/ScraperControl
|
||||||
|
npx tsc --noEmit
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: no errors.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/scheduler.ts
|
||||||
|
git commit -m "fix: write heartbeat file for Docker healthcheck"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Fix scheduler healthcheck in docker-compose.yml
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docker-compose.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Replace the scheduler healthcheck**
|
||||||
|
|
||||||
|
Find this block in `docker-compose.yml` (around line 275):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "pgrep -f scheduler.ts || exit 1"]
|
||||||
|
interval: 60s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 30s
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . || exit 1"]
|
||||||
|
interval: 90s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 90s
|
||||||
|
```
|
||||||
|
|
||||||
|
The `find ... -mmin -120` check passes if the file exists and was modified within the last 120 minutes (2 hours). The `start_period: 90s` gives the scheduler time to reach its first hourly cron tick before Docker starts evaluating health.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docker-compose.yml
|
||||||
|
git commit -m "fix: replace pgrep healthcheck with heartbeat file check"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Deploy and verify
|
||||||
|
|
||||||
|
- [ ] **Step 1: Sync dev directory to Docker deployment**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /home/albert/Documents/ScraperControl
|
||||||
|
bash scripts/deploy-local.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: rsync output showing the three changed files transferred to `/opt/docker/scraper-control/`.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Restart the two affected containers**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -f /opt/docker/scraper-control/docker-compose.yml restart freesearch-enrichment scheduler
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify freesearch-enrichment is stable**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker logs scraper-control-freesearch-enrichment-1 --tail 30 -f
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: logs showing "Waiting for FreeSearch to be reachable..." with retry messages if FreeSearch is still down, OR "FreeSearch health check: OK" and normal enrichment if FreeSearch is up. Container should NOT exit. Wait 2 minutes to confirm no restart.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Confirm stale jobs were cleaned up**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec scraper-control-db-1 psql -U postgres -d nearestmass \
|
||||||
|
-c "SELECT type, status, started_at, completed_at, error FROM background_jobs WHERE type = 'freesearch-enrichment' ORDER BY started_at DESC LIMIT 5;"
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: the two previously-stuck `running` jobs from Mar 22 and Mar 26 now show `status = 'failed'` with `error = 'Container restarted'`.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Verify scheduler heartbeat file is written**
|
||||||
|
|
||||||
|
Check if the file already exists from before (it won't — it's new). Wait for next hourly cron tick, or check after 60 minutes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec scraper-control-scheduler-1 cat /app/logs/scheduler.heartbeat
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: an ISO timestamp, e.g. `2026-03-28T14:00:00.000Z`
|
||||||
|
|
||||||
|
- [ ] **Step 6: Verify scheduler becomes healthy**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker ps --format "table {{.Names}}\t{{.Status}}" | grep scheduler
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `scraper-control-scheduler-1 Up X hours (healthy)` — but only after the first heartbeat fires AND Docker's `start_period` (90s) passes. If the next cron tick hasn't happened yet, `status` will remain `starting` or `unhealthy` until it does.
|
||||||
|
|
||||||
|
To force an immediate test without waiting for the cron:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec scraper-control-scheduler-1 bash -c \
|
||||||
|
"date -u +%Y-%m-%dT%H:%M:%S.000Z > /app/logs/scheduler.heartbeat && echo 'written'"
|
||||||
|
docker exec scraper-control-scheduler-1 \
|
||||||
|
find /app/logs/scheduler.heartbeat -mmin -120 2>/dev/null | grep -q . && echo "PASS" || echo "FAIL"
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `written` then `PASS`.
|
||||||
Reference in New Issue
Block a user