# cLLMHub Documentation

cLLMHub is an OpenAI-compatible inference hub. Subscribe to a plan, get an API key, and call open-weight models through one endpoint. You can also publish your own LLM backend (Ollama, vLLM, llama.cpp, MLX, LM Studio) with the cLLMHub CLI — optional.

---

# Getting Started

## Introduction

### What is cLLMHub?

cLLMHub is a hosted gateway for self-hosted LLMs. You run an open-weight model on your own hardware (Ollama, vLLM, llama.cpp, MLX, LM Studio, or any OpenAI-compatible server) and we expose it as a stable, OpenAI-compatible API. You subscribe to a monthly plan; your plan determines daily request quota and how many API keys and models you can register.

### Three things to know

1. **You bring the model.** cLLMHub does not host model weights — you do. Publish your local backend with the CLI to make it callable through the hub.
2. **The API is OpenAI-compatible.** Any tool, library, or app that already works with OpenAI works with cLLMHub by changing the base URL and the API key.
3. **Pricing is a flat monthly subscription, not per-token.** Free, Pro, and Max plans each set a daily request quota — no surprise bills.

---

## Quickstart

### 1. Sign up

Sign up at https://cllmhub.com/login with email or Google. New accounts start on the Free tier.

### 2. Create an API key

Go to [API Keys](https://cllmhub.com/api-keys) and create a key. The full key (`sk-...`) is shown once — copy it immediately.

### 3. Make your first call

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.cllmhub.com/v1",
    api_key="sk-your-api-key",
)

response = client.chat.completions.create(
    model="llama3-70b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
```

---

## Subscriptions

| Plan | Price   | API Keys  | Requests/day | Models    |
|------|---------|-----------|--------------|-----------|
| Free | $0/mo   | 3         | 2,000        | 3         |
| Pro  | $3/mo   | 10        | 20,000       | 10        |
| Max  | $20/mo  | Unlimited | Unlimited    | Unlimited |

Plans are billed monthly via PayPal. Cancel anytime — access continues through the end of the current billing period. Upgrade or change plans at [https://cllmhub.com/subscribe](https://cllmhub.com/subscribe).

The `models` limit caps the total number of distinct models you have **published** through the CLI; it does not limit which hub-served models you can call.

---

## API Keys

### Bearer tokens

All API requests authenticate via the `Authorization` header:

```
Authorization: Bearer sk-your-api-key
```

### Key restrictions

When creating a key at [https://cllmhub.com/api-keys](https://cllmhub.com/api-keys), you can optionally set:

- **Model scoping** — limit the key to specific model IDs. Other models return `403 Forbidden`.
- **IP allowlist** (`allowed_ips`) — limit the key to a list of source IPs. Other IPs return `403`.
- **Daily request cap** (`max_requests_per_day`) — set a daily ceiling. Exceeding it returns `429 Too Many Requests` until midnight.

### Model scoping example

```bash
# Works — key is scoped to llama3-70b
curl https://api.cllmhub.com/v1/chat/completions \
  -H "Authorization: Bearer sk-abc..." \
  -d '{"model": "llama3-70b", "messages": [...]}'

# Fails with 403 — key is NOT scoped to mixtral
curl https://api.cllmhub.com/v1/chat/completions \
  -H "Authorization: Bearer sk-abc..." \
  -d '{"model": "mixtral-8x7b", "messages": [...]}'
```

### Security best practices

- Treat API keys like passwords — never commit them to version control
- Use separate keys for different environments (dev, staging, prod)
- Rotate keys if you suspect a leak by deleting the old key and creating a new one

---

# Management API

## Management keys (`cmk-...`)

Management keys are separate from inference keys. They authorise the `/v1/management/*` REST surface for programmatic account management — creating and revoking inference keys, listing models, disabling your own published models, and pulling request logs. Management keys **cannot** call `/v1/chat/completions` or any other inference endpoint.

Mint one from the avatar menu (top-right) → **Management API Keys**. The plaintext key is shown exactly once.

### REST endpoints

All endpoints take `Authorization: Bearer cmk-...`.

| Method   | Path                                  | Purpose                          |
| -------- | ------------------------------------- | -------------------------------- |
| `GET`    | `/v1/management/api-keys`                  | List inference keys              |
| `POST`   | `/v1/management/api-keys`                  | Create inference key             |
| `GET`    | `/v1/management/api-keys/{key_or_hash}`    | Inspect a single key             |
| `PATCH`  | `/v1/management/api-keys/{key_or_hash}`    | Update rate limit / IP allowlist |
| `DELETE` | `/v1/management/api-keys/{key_or_hash}`    | Revoke                           |
| `GET`    | `/v1/management/models`                    | Full model catalog               |
| `GET`    | `/v1/management/my-models`                 | Your published models            |
| `PATCH`  | `/v1/management/models/{model}`            | Disable/enable a model           |
| `GET`    | `/v1/management/logs`                      | Request logs                     |

### Example: create an inference key

```bash
curl https://api.cllmhub.com/v1/management/api-keys \
  -H "Authorization: Bearer cmk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ci-deploy",
    "models": ["llama3-70b"],
    "max_requests_per_day": 5000,
    "allowed_ips": ["203.0.113.5"]
  }'
```

The response includes the plaintext `sk-...` key once — store it immediately.

---

# API Reference

## Base URL

```
https://api.cllmhub.com/v1
```

## GET /v1/models

List every model currently available.

```json
{
  "object": "list",
  "data": [
    {
      "id": "llama3-70b",
      "object": "model",
      "created": 1700000000,
      "owned_by": "cllmhub"
    }
  ]
}
```

## GET /v1/models/{id}

Get details for a specific model.

## POST /v1/completions

Generate a text completion from a prompt.

**Request**

```json
{
  "model": "llama3-70b",
  "prompt": "Once upon a time",
  "max_tokens": 256,
  "temperature": 0.7,
  "stream": false
}
```

**Response**

```json
{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "model": "llama3-70b",
  "choices": [{
    "text": " there was a curious fox...",
    "index": 0,
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 12,
    "total_tokens": 16
  }
}
```

## POST /v1/chat/completions

Generate a chat completion from a list of messages. Messages support text and vision content — see Vision & Multimodal for image inputs.

**Request**

```json
{
  "model": "llama3-70b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 256,
  "stream": false
}
```

**Response**

```json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "llama3-70b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 9,
    "total_tokens": 29
  }
}
```

## Request parameters

| Parameter     | Type             | Description                                                               |
|---------------|------------------|---------------------------------------------------------------------------|
| `model`        | string, required | Model ID                                                                  |
| `messages`     | array            | Required for chat. Role + content; content can be string or content parts  |
| `prompt`       | string           | Required for /v1/completions                                              |
| `max_tokens`   | integer          | Max tokens to generate (default 512)                                      |
| `temperature`  | float            | Sampling temperature 0-2 (default 0.7)                                    |
| `top_p`        | float            | Nucleus sampling threshold (default 1.0)                                  |
| `stream`       | boolean          | Enable SSE streaming (default false)                                      |
| `stop`         | string or array  | Stop sequence(s)                                                          |

---

## Streaming

### How it works

Set `"stream": true` in any completion or chat completion request. The response is delivered as Server-Sent Events (SSE) — each token arrives as a JSON chunk prefixed with `data: `. The stream ends with a `data: [DONE]` sentinel.

### Chat streaming example

```bash
curl https://api.cllmhub.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key" \
  -d '{
    "model": "llama3-70b",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'
```

### SSE response format

```
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"llama3-70b","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"llama3-70b","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"llama3-70b","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":"stop"}]}

data: [DONE]
```

### Mid-stream errors

If an error occurs after the response stream has begun, it is sent as a `data: [ERROR] ...` event instead of a regular chunk. Your client should handle this gracefully and may retry the request.

---

## Vision & Multimodal

### Image inputs

Vision-capable models accept image input through the same `/v1/chat/completions` endpoint. Pass an array of content parts in the message instead of a plain string. Each part is either a text part or an `image_url` part.

### Example: image + text

```json
{
  "model": "llava-v1.6",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {
          "type": "image_url",
          "image_url": {"url": "data:image/png;base64,..."}
        }
      ]
    }
  ],
  "max_tokens": 256
}
```

### Supported formats

PNG, JPEG, GIF, and WebP. Images can be passed as a data URI (base64-encoded inline) or as a public URL. Oversized images may be downscaled automatically before being passed to the model.

---

## SDK Examples

### Drop-in replacement

cLLMHub exposes an OpenAI-compatible API. Any tool, library, or framework that works with OpenAI can be pointed at cLLMHub by changing the base URL and the API key. No code changes beyond configuration.

### Python

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.cllmhub.com/v1",
    api_key="sk-your-api-key",
)

response = client.chat.completions.create(
    model="llama3-70b",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="llama3-70b",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

### Node.js

```javascript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.cllmhub.com/v1",
  apiKey: "sk-your-api-key",
});

const response = await client.chat.completions.create({
  model: "llama3-70b",
  messages: [{ role: "user", content: "Hello!" }],
  max_tokens: 128,
});
console.log(response.choices[0].message.content);
```

### curl

```bash
curl https://api.cllmhub.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key" \
  -d '{
    "model": "llama3-70b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'
```

---

## Errors & Rate Limits

### Error format

```json
{
  "error": {
    "message": "Description of what went wrong",
    "type": "invalid_request_error",
    "code": "invalid_request"
  }
}
```

### HTTP status codes

| Status | Code              | Description                                                       |
|--------|-------------------|-------------------------------------------------------------------|
| 400    | invalid_request   | Malformed JSON or missing required fields                         |
| 401    | unauthorized      | Missing or invalid API key                                        |
| 402    | payment_required  | Subscription required; upgrade at /subscribe                      |
| 403    | forbidden         | API key not scoped to the requested model, or model is disabled   |
| 429    | rate_limit        | Daily quota exceeded (plan-wide or per-key cap)                   |
| 502    | provider_error    | The upstream backend returned an error                            |
| 503    | providers_busy    | All backends for this model are busy, retry later                 |

### Retry guidance

- **429** errors will not succeed until midnight UTC, until you raise the per-key cap, or until you upgrade your plan.
- **503** errors are transient and safe to retry with exponential backoff.
- **502** errors usually mean the backend is having trouble; retrying may route to a different backend.

---

# Publishing Models (optional)

> Register your own LLM backend with the cLLMHub CLI. Skip this section if you only want to consume hub-served models.

## What publishing does

If you run an LLM backend yourself (Ollama, vLLM, llama.cpp, MLX, LM Studio, or any OpenAI-compatible server), the cLLMHub CLI can register it with the hub. Requests to that model name then route through the gateway to your backend over a persistent connection — your backend never needs an inbound port.

### When you would use this

- You have a fine-tuned model you want to call from anywhere without setting up your own API stack.
- You have spare GPU on a workstation and want to use it from a deployed app.
- You want to test a backend behind your firewall through a public OpenAI-compatible endpoint.

## Plan limits

Free supports up to 3 published models per account, Pro 10, Max unlimited. Each unique model name counts as one.

---

## Install the CLI

### macOS / Linux

```bash
curl -fsSL https://raw.githubusercontent.com/cllmhub/cllmhub-cli/main/install.sh | sh
```

### Homebrew

```bash
brew tap cllmhub/tap
brew install cllmhub
```

### npm

```bash
npm install -g cllmhub
# or run without installing
npx cllmhub --help
```

### Windows

```powershell
Invoke-WebRequest -Uri https://github.com/cllmhub/cllmhub-cli/releases/latest/download/cllmhub-windows-amd64.exe -OutFile cllmhub.exe
```

### Build from source

The CLI is open source under the Apache 2.0 license. Requires Go 1.22+:

```bash
git clone https://github.com/cllmhub/cllmhub-cli.git
cd cllmhub-cli
make build
```

---

## Log in

```bash
cllmhub login    # OAuth device flow — opens a browser to authenticate
cllmhub whoami   # Confirm which account is currently authenticated
cllmhub logout   # Revoke credentials on the server
```

---

## Publish your first model

### How publishing works

When you run `cllmhub publish`, the CLI discovers models from your local backends (Ollama, vLLM, llama.cpp, MLX, LM Studio). Pick one to publish, and the CLI registers it with cLLMHub, opens a persistent connection, and starts sending heartbeats every 30 seconds. As long as the CLI is running, your model is "Active" and can receive requests.

### Quickest path: Ollama

```bash
# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3

# Authenticate the CLI
cllmhub login

# Publish to cLLMHub
cllmhub publish
```

### Direct publish with flags

```bash
cllmhub publish -m llama3-70b -b ollama
cllmhub publish -m mixtral-8x7b -b vllm
cllmhub publish -m my-model -b mlx --api-key sk-xxx
```

### Model lifecycle

- **Active** — The CLI is running and sending heartbeats; the model is reachable
- **Inactive** — The CLI stopped or lost connection. Models are removed after roughly 120 seconds of missed heartbeats
- **Disabled** — You manually toggled the model off from [https://cllmhub.com/my-models](https://cllmhub.com/my-models). Disabled models reject all requests with 403 until you re-enable them

---

## Supported backends

- **Ollama** — default, expected at http://localhost:11434
- **vLLM** — expected at http://localhost:8000
- **llama.cpp** — expected at http://localhost:8080
- **MLX (Apple Silicon)** — optimized via mlx-lm, expected at http://localhost:8080
- **LM Studio** — expected at http://localhost:1234
- **Custom OpenAI-compatible server** — pass URL with `--backend-url` and key with `--api-key`

---

## Troubleshooting

### "My model is not showing up"

Check that the CLI is still running and that `cllmhub status` shows the model as published. Models are auto-removed after ~120 seconds without a heartbeat.

### "Requests are failing with 502"

The gateway reached your CLI but your backend (Ollama, vLLM, etc.) returned an error. Common causes: the backend crashed, ran out of GPU memory, or timed out on a long generation.

### "The CLI keeps disconnecting"

Update to the latest CLI (`cllmhub update`) — recent versions have improved auto-reconnect. Check your network for intermittent drops.

### "How do I migrate to a different machine?"

Run `cllmhub unpublish` on the old machine, install the CLI on the new machine, run `cllmhub login` with the same account, then `cllmhub publish`. The model name stays the same, so existing API keys keep working without changes.
