# FLM Proxy Node.js HTTP proxy that sits in front of [FastFlowLM](https://github.com/amd/FastFlowLM) to serve LLM inference on an AMD NPU. The proxy lazily starts the model on the first request and automatically stops it after an idle timeout to free RAM. ## How It Works - External clients hit the proxy on **port 8000** - On the first request, the proxy spawns `flm.exe` which serves an OpenAI-compatible API on port 8001 - All subsequent requests are proxied through to FLM - After 5 minutes of inactivity, the model process is killed to reclaim memory - The next request will cold-start the model again (~10-15 seconds) ## Configuration | Setting | Default | Description | |---------|---------|-------------| | `MODEL` | `qwen2.5vl-it:3b` | Model to serve (see `FastFlowLM/model_list.json`) | | `PROXY_PORT` | `8000` | External-facing port | | `FLM_PORT` | `8001` | Internal FLM server port | | `IDLE_TIMEOUT_MS` | `300000` (5 min) | Idle time before stopping the model | | `HOST` | `0.0.0.0` | Listen address | ## Endpoints | Endpoint | Description | |----------|-------------| | `/v1/chat/completions` | OpenAI-compatible chat (proxied to FLM) | | `/v1/models` | List available models (proxied to FLM) | | `/status` | Proxy status — model ready, starting, PID | | `/stop` | Manually stop the model and free RAM | ## Usage ```bash # Install dependencies npm install # Run in foreground node flm-proxy.js # Install as a Windows service node flm-service-install.js # Uninstall Windows service node flm-service-uninstall.js ``` ## Service Logs When running as a Windows service, logs are written to: - `~/daemon/flmvisionproxy.out.log` - `~/daemon/flmvisionproxy.err.log` ## Environment - **OS:** Windows 11, AMD NPU hardware - **Runtime:** Node.js - **FLM binary:** `C:\Users\sshuser\FastFlowLM\flm.exe` - **Dependencies:** `node-windows` (for service install) ## Available Models See `FastFlowLM/model_list.json` for the full catalog. Model identifiers use the format `family:size` (e.g., `qwen3:4b`, `llama3.2:3b`). Vision models have `"vlm": true`, thinking models have `"think": true`.