# FLM Proxy

Node.js HTTP proxy that sits in front of [FastFlowLM](https://github.com/amd/FastFlowLM) to serve LLM inference on an AMD NPU. The proxy lazily starts the model on the first request and automatically stops it after an idle timeout to free RAM.

## How It Works

- External clients hit the proxy on **port 8000**
- On the first request, the proxy spawns `flm.exe` which serves an OpenAI-compatible API on port 8001
- All subsequent requests are proxied through to FLM
- After 5 minutes of inactivity, the model process is killed to reclaim memory
- The next request will cold-start the model again (~10-15 seconds)

## Configuration

| Setting | Default | Description |
|---------|---------|-------------|
| `MODEL` | `qwen2.5vl-it:3b` | Model to serve (see `FastFlowLM/model_list.json`) |
| `PROXY_PORT` | `8000` | External-facing port |
| `FLM_PORT` | `8001` | Internal FLM server port |
| `IDLE_TIMEOUT_MS` | `300000` (5 min) | Idle time before stopping the model |
| `HOST` | `0.0.0.0` | Listen address |

## Endpoints

| Endpoint | Description |
|----------|-------------|
| `/v1/chat/completions` | OpenAI-compatible chat (proxied to FLM) |
| `/v1/models` | List available models (proxied to FLM) |
| `/status` | Proxy status — model ready, starting, PID |
| `/stop` | Manually stop the model and free RAM |

## Usage

```bash
# Install dependencies
npm install

# Run in foreground
node flm-proxy.js

# Install as a Windows service
node flm-service-install.js

# Uninstall Windows service
node flm-service-uninstall.js
```

## Service Logs

When running as a Windows service, logs are written to:
- `~/daemon/flmvisionproxy.out.log`
- `~/daemon/flmvisionproxy.err.log`

## Environment

- **OS:** Windows 11, AMD NPU hardware
- **Runtime:** Node.js
- **FLM binary:** `C:\Users\sshuser\FastFlowLM\flm.exe`
- **Dependencies:** `node-windows` (for service install)

## Available Models

See `FastFlowLM/model_list.json` for the full catalog. Model identifiers use the format `family:size` (e.g., `qwen3:4b`, `llama3.2:3b`). Vision models have `"vlm": true`, thinking models have `"think": true`.