Add README with project documentation

This commit is contained in:
2026-03-29 22:04:41 -04:00
parent a5dcb56f7d
commit 57ed2f5505

63
README.md Normal file
View File

@@ -0,0 +1,63 @@
# FLM Proxy
Node.js HTTP proxy that sits in front of [FastFlowLM](https://github.com/amd/FastFlowLM) to serve LLM inference on an AMD NPU. The proxy lazily starts the model on the first request and automatically stops it after an idle timeout to free RAM.
## How It Works
- External clients hit the proxy on **port 8000**
- On the first request, the proxy spawns `flm.exe` which serves an OpenAI-compatible API on port 8001
- All subsequent requests are proxied through to FLM
- After 5 minutes of inactivity, the model process is killed to reclaim memory
- The next request will cold-start the model again (~10-15 seconds)
## Configuration
| Setting | Default | Description |
|---------|---------|-------------|
| `MODEL` | `qwen2.5vl-it:3b` | Model to serve (see `FastFlowLM/model_list.json`) |
| `PROXY_PORT` | `8000` | External-facing port |
| `FLM_PORT` | `8001` | Internal FLM server port |
| `IDLE_TIMEOUT_MS` | `300000` (5 min) | Idle time before stopping the model |
| `HOST` | `0.0.0.0` | Listen address |
## Endpoints
| Endpoint | Description |
|----------|-------------|
| `/v1/chat/completions` | OpenAI-compatible chat (proxied to FLM) |
| `/v1/models` | List available models (proxied to FLM) |
| `/status` | Proxy status — model ready, starting, PID |
| `/stop` | Manually stop the model and free RAM |
## Usage
```bash
# Install dependencies
npm install
# Run in foreground
node flm-proxy.js
# Install as a Windows service
node flm-service-install.js
# Uninstall Windows service
node flm-service-uninstall.js
```
## Service Logs
When running as a Windows service, logs are written to:
- `~/daemon/flmvisionproxy.out.log`
- `~/daemon/flmvisionproxy.err.log`
## Environment
- **OS:** Windows 11, AMD NPU hardware
- **Runtime:** Node.js
- **FLM binary:** `C:\Users\sshuser\FastFlowLM\flm.exe`
- **Dependencies:** `node-windows` (for service install)
## Available Models
See `FastFlowLM/model_list.json` for the full catalog. Model identifiers use the format `family:size` (e.g., `qwen3:4b`, `llama3.2:3b`). Vision models have `"vlm": true`, thinking models have `"think": true`.