FLM Proxy

Node.js HTTP proxy that sits in front of FastFlowLM to serve LLM inference on an AMD NPU. The proxy lazily starts the model on the first request and automatically stops it after an idle timeout to free RAM.

How It Works

External clients hit the proxy on port 8000
On the first request, the proxy spawns flm.exe which serves an OpenAI-compatible API on port 8001
All subsequent requests are proxied through to FLM
After 5 minutes of inactivity, the model process is killed to reclaim memory
The next request will cold-start the model again (~10-15 seconds)

Configuration

Setting	Default	Description
`MODEL`	`qwen2.5vl-it:3b`	Model to serve (see `FastFlowLM/model_list.json`)
`PROXY_PORT`	`8000`	External-facing port
`FLM_PORT`	`8001`	Internal FLM server port
`IDLE_TIMEOUT_MS`	`300000` (5 min)	Idle time before stopping the model
`HOST`	`0.0.0.0`	Listen address

Endpoints

Endpoint	Description
`/v1/chat/completions`	OpenAI-compatible chat (proxied to FLM)
`/v1/models`	List available models (proxied to FLM)
`/status`	Proxy status — model ready, starting, PID
`/stop`	Manually stop the model and free RAM

Usage

# Install dependencies
npm install

# Run in foreground
node flm-proxy.js

# Install as a Windows service
node flm-service-install.js

# Uninstall Windows service
node flm-service-uninstall.js

Service Logs

When running as a Windows service, logs are written to:

~/daemon/flmvisionproxy.out.log
~/daemon/flmvisionproxy.err.log

Environment

OS: Windows 11, AMD NPU hardware
Runtime: Node.js
FLM binary: C:\Users\sshuser\FastFlowLM\flm.exe
Dependencies: node-windows (for service install)

Available Models

See FastFlowLM/model_list.json for the full catalog. Model identifiers use the format family:size (e.g., qwen3:4b, llama3.2:3b). Vision models have "vlm": true, thinking models have "think": true.

2.1 KiB Raw Permalink Blame History