Files
flm-proxy/README.md

2.1 KiB

FLM Proxy

Node.js HTTP proxy that sits in front of FastFlowLM to serve LLM inference on an AMD NPU. The proxy lazily starts the model on the first request and automatically stops it after an idle timeout to free RAM.

How It Works

  • External clients hit the proxy on port 8000
  • On the first request, the proxy spawns flm.exe which serves an OpenAI-compatible API on port 8001
  • All subsequent requests are proxied through to FLM
  • After 5 minutes of inactivity, the model process is killed to reclaim memory
  • The next request will cold-start the model again (~10-15 seconds)

Configuration

Setting Default Description
MODEL qwen2.5vl-it:3b Model to serve (see FastFlowLM/model_list.json)
PROXY_PORT 8000 External-facing port
FLM_PORT 8001 Internal FLM server port
IDLE_TIMEOUT_MS 300000 (5 min) Idle time before stopping the model
HOST 0.0.0.0 Listen address

Endpoints

Endpoint Description
/v1/chat/completions OpenAI-compatible chat (proxied to FLM)
/v1/models List available models (proxied to FLM)
/status Proxy status — model ready, starting, PID
/stop Manually stop the model and free RAM

Usage

# Install dependencies
npm install

# Run in foreground
node flm-proxy.js

# Install as a Windows service
node flm-service-install.js

# Uninstall Windows service
node flm-service-uninstall.js

Service Logs

When running as a Windows service, logs are written to:

  • ~/daemon/flmvisionproxy.out.log
  • ~/daemon/flmvisionproxy.err.log

Environment

  • OS: Windows 11, AMD NPU hardware
  • Runtime: Node.js
  • FLM binary: C:\Users\sshuser\FastFlowLM\flm.exe
  • Dependencies: node-windows (for service install)

Available Models

See FastFlowLM/model_list.json for the full catalog. Model identifiers use the format family:size (e.g., qwen3:4b, llama3.2:3b). Vision models have "vlm": true, thinking models have "think": true.