diff --git a/README.md b/README.md new file mode 100644 index 0000000..b9cb9d2 --- /dev/null +++ b/README.md @@ -0,0 +1,63 @@ +# FLM Proxy + +Node.js HTTP proxy that sits in front of [FastFlowLM](https://github.com/amd/FastFlowLM) to serve LLM inference on an AMD NPU. The proxy lazily starts the model on the first request and automatically stops it after an idle timeout to free RAM. + +## How It Works + +- External clients hit the proxy on **port 8000** +- On the first request, the proxy spawns `flm.exe` which serves an OpenAI-compatible API on port 8001 +- All subsequent requests are proxied through to FLM +- After 5 minutes of inactivity, the model process is killed to reclaim memory +- The next request will cold-start the model again (~10-15 seconds) + +## Configuration + +| Setting | Default | Description | +|---------|---------|-------------| +| `MODEL` | `qwen2.5vl-it:3b` | Model to serve (see `FastFlowLM/model_list.json`) | +| `PROXY_PORT` | `8000` | External-facing port | +| `FLM_PORT` | `8001` | Internal FLM server port | +| `IDLE_TIMEOUT_MS` | `300000` (5 min) | Idle time before stopping the model | +| `HOST` | `0.0.0.0` | Listen address | + +## Endpoints + +| Endpoint | Description | +|----------|-------------| +| `/v1/chat/completions` | OpenAI-compatible chat (proxied to FLM) | +| `/v1/models` | List available models (proxied to FLM) | +| `/status` | Proxy status — model ready, starting, PID | +| `/stop` | Manually stop the model and free RAM | + +## Usage + +```bash +# Install dependencies +npm install + +# Run in foreground +node flm-proxy.js + +# Install as a Windows service +node flm-service-install.js + +# Uninstall Windows service +node flm-service-uninstall.js +``` + +## Service Logs + +When running as a Windows service, logs are written to: +- `~/daemon/flmvisionproxy.out.log` +- `~/daemon/flmvisionproxy.err.log` + +## Environment + +- **OS:** Windows 11, AMD NPU hardware +- **Runtime:** Node.js +- **FLM binary:** `C:\Users\sshuser\FastFlowLM\flm.exe` +- **Dependencies:** `node-windows` (for service install) + +## Available Models + +See `FastFlowLM/model_list.json` for the full catalog. Model identifiers use the format `family:size` (e.g., `qwen3:4b`, `llama3.2:3b`). Vision models have `"vlm": true`, thinking models have `"think": true`.