2.1 KiB
2.1 KiB
FLM Proxy
Node.js HTTP proxy that sits in front of FastFlowLM to serve LLM inference on an AMD NPU. The proxy lazily starts the model on the first request and automatically stops it after an idle timeout to free RAM.
How It Works
- External clients hit the proxy on port 8000
- On the first request, the proxy spawns
flm.exewhich serves an OpenAI-compatible API on port 8001 - All subsequent requests are proxied through to FLM
- After 5 minutes of inactivity, the model process is killed to reclaim memory
- The next request will cold-start the model again (~10-15 seconds)
Configuration
| Setting | Default | Description |
|---|---|---|
MODEL |
qwen2.5vl-it:3b |
Model to serve (see FastFlowLM/model_list.json) |
PROXY_PORT |
8000 |
External-facing port |
FLM_PORT |
8001 |
Internal FLM server port |
IDLE_TIMEOUT_MS |
300000 (5 min) |
Idle time before stopping the model |
HOST |
0.0.0.0 |
Listen address |
Endpoints
| Endpoint | Description |
|---|---|
/v1/chat/completions |
OpenAI-compatible chat (proxied to FLM) |
/v1/models |
List available models (proxied to FLM) |
/status |
Proxy status — model ready, starting, PID |
/stop |
Manually stop the model and free RAM |
Usage
# Install dependencies
npm install
# Run in foreground
node flm-proxy.js
# Install as a Windows service
node flm-service-install.js
# Uninstall Windows service
node flm-service-uninstall.js
Service Logs
When running as a Windows service, logs are written to:
~/daemon/flmvisionproxy.out.log~/daemon/flmvisionproxy.err.log
Environment
- OS: Windows 11, AMD NPU hardware
- Runtime: Node.js
- FLM binary:
C:\Users\sshuser\FastFlowLM\flm.exe - Dependencies:
node-windows(for service install)
Available Models
See FastFlowLM/model_list.json for the full catalog. Model identifiers use the format family:size (e.g., qwen3:4b, llama3.2:3b). Vision models have "vlm": true, thinking models have "think": true.