# cLLMHub Documentation

## What is cLLMHub?

cLLMHub turns any local LLM into a production-ready, OpenAI-compatible API. If you have a GPU (or even a CPU) running a model through Ollama, vLLM, llama.cpp, or MLX, cLLMHub lets you publish that model to a shared hub so anyone with an API key can use it — from anywhere. You can also download GGUF models directly from Hugging Face and run them with the built-in daemon.

### Why cLLMHub?

Most hosted inference platforms charge per token and lock you into their infrastructure. cLLMHub is different: you own the hardware, you own the model, and you decide who gets access. There is no token-based billing — pricing is a flat monthly fee based on how many models and keys you need, not how many tokens you produce.

### Who is it for?

- Developers who want to self-host models and expose them as APIs
- Teams that need a private inference gateway without sending data to third parties
- Hobbyists who want to share a model with friends
- Anyone who wants OpenAI SDK compatibility without OpenAI

### How it works

1. You run the `cllmhub` CLI on your machine — either with a downloaded GGUF model (via the built-in daemon) or alongside an external backend (Ollama, vLLM, MLX, etc.)
2. The CLI registers your model with the hub and keeps a live connection
3. Consumers create API keys scoped to your model and call the standard `/v1/chat/completions` endpoint
4. The hub routes the request to your machine, your backend generates the response, and it streams back to the consumer

---

## Quick Start

### 1. Create an account

Sign up at https://cllmhub.com/login with your email or Google account. No credit card required for the free tier.

### 2. Install the CLI

Install via npm, Homebrew, or download a pre-built binary from Settings:

```bash
# npm
npm install -g cllmhub

# Homebrew
brew install cllmhub/tap/cllmhub

# Shell script (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/cllmhub/cllmhub-cli/main/install.sh | sh
```

### 3. Log in

Authenticate the CLI with your cLLMHub account:

```bash
cllmhub login
```

This opens a browser where you approve the device.

### 4. Publish your model

Run `cllmhub publish` to open the interactive publish flow. It discovers downloaded models and configured external engines, letting you choose what to publish.

```bash
cllmhub publish
```

### 5. Create an API key

Go to Settings > API Keys, select the model you just published, and click "Create Key". Copy the key — it is only shown once.

### 6. Make your first request

```bash
curl https://cllmhub.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key" \
  -d '{
    "model": "llama3-70b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'
```

---

## CLI Reference

The CLI is open source under Apache 2.0: [github.com/cllmhub/cllmhub-cli](https://github.com/cllmhub/cllmhub-cli)

Install via npm, Homebrew, or download a pre-built binary from Settings:

```bash
# npm (installs the correct binary for your platform)
npm install -g cllmhub

# Or run without installing
npx cllmhub --help

# Homebrew
brew tap cllmhub/tap
brew install cllmhub

# Shell script (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/cllmhub/cllmhub-cli/main/install.sh | sh

# Build from source (requires Go 1.22+)
git clone https://github.com/cllmhub/cllmhub-cli.git
cd cllmhub-cli
make build
```

Available for macOS (Apple Silicon & Intel), Linux (x86_64 & ARM64), and Windows (x86_64).

### cllmhub login

Authenticate with your cLLMHub account using the OAuth device flow. Opens a browser to approve the device. After login, the CLI discovers models from local backends and lets you select one to publish immediately.

```bash
cllmhub login
```

### cllmhub whoami

Show the currently logged-in user.

```bash
cllmhub whoami
```

### cllmhub models

List downloaded models, or search Hugging Face for GGUF models. Search results only include text-generation models.

```bash
cllmhub models                    # List downloaded models
cllmhub models --search mistral   # Search Hugging Face
cllmhub models --search "llama 7b"
```

Flags:
- `--search, -s` — Search Hugging Face for GGUF models

### cllmhub download

Download GGUF model files from Hugging Face repositories. Lists available GGUF files and lets you pick which quantization to download. For faster downloads and access to gated models, pass a Hugging Face token with `--hf-token` (it will be saved for future use).

```bash
cllmhub download TheBloke/Mistral-7B-v0.1-GGUF
cllmhub download --hf-token <token> TheBloke/Mistral-7B-v0.1-GGUF
cllmhub download TheBloke/Mistral-7B-v0.1-GGUF TheBloke/Llama-2-7B-GGUF
```

Flags:
- `--hf-token` — Hugging Face token (saved for future use)

### cllmhub delete

Delete one or more downloaded models. Prevents deletion of currently published models and shows freed disk space.

```bash
cllmhub delete mistral-7b
cllmhub delete m1 m2   # Use aliases
```

### cllmhub start

Start the cLLMHub daemon. Runs a local llama-server instance for downloaded GGUF models. Engine settings are auto-detected based on your hardware (Apple Silicon, NVIDIA GPU, or CPU) but can be overridden.

```bash
cllmhub start                                          # Auto-detect everything
cllmhub start --ctx-size 8192 --flash-attn --slots 2   # Custom settings
cllmhub start --n-gpu-layers 0 --ctx-size 2048          # CPU-only
```

Flags:
- `--ctx-size` — Context size for inference (0 = auto-detect)
- `--flash-attn` — Enable flash attention (auto-enabled on Apple Silicon/NVIDIA)
- `--slots` — Number of concurrent inference slots (0 = auto-detect)
- `--n-gpu-layers` — Number of layers to offload to GPU (-1 = auto, 0 = CPU only)
- `--batch-size` — Batch size for prompt processing (0 = auto-detect)

### cllmhub stop / status / logs

Stop the running daemon, show daemon status (PID, uptime, published models), or view daemon logs.

```bash
cllmhub stop
cllmhub status
cllmhub logs            # Show recent logs
cllmhub logs -f         # Follow log output
cllmhub logs -n 100     # Show last 100 lines
```

### cllmhub publish

Publish models to the cLLMHub network. All publishing goes through the background daemon. Use flags to specify a model and backend directly, or run without flags for interactive selection from detected backends.

```bash
# Direct publish
cllmhub publish -m llama3-70b -b ollama
cllmhub publish -m mixtral-8x7b -b vllm
cllmhub publish -m my-model -b mlx --api-key sk-xxx

# Interactive selection
cllmhub publish
```

Features: auto-reconnect on WebSocket disconnect (up to 5 retries), model server health monitoring, heartbeat to keep your provider registered, and concurrency control.

Flags:
- `--model, -m` — Model name to publish
- `--backend, -b` — Backend type: ollama | vllm | lmstudio | llamacpp | mlx (default: ollama)
- `--backend-url` — Backend endpoint URL (overrides default for the backend type)
- `--api-key` — API key for the backend server
- `--description, -d` — Model description
- `--max-concurrent, -c` — Maximum concurrent requests (0 = auto-detect, default: 0)

### cllmhub unpublish

Stop serving one or more published models. The models remain downloaded locally.

```bash
cllmhub unpublish mistral-7b
cllmhub unpublish m1 m2
```

### cllmhub logout

Revoke credentials on the server and remove the local credentials file.

```bash
cllmhub logout
```

### cllmhub update

Update the CLI to the latest version. The CLI also checks for updates automatically after each command.

```bash
cllmhub update
```

### Supported backends

| Backend    | Default endpoint          | Use case                              |
|------------|---------------------------|---------------------------------------|
| ollama     | http://localhost:11434    | Most common, simple setup             |
| vllm       | http://localhost:8000     | High throughput, GPU optimized        |
| lmstudio   | http://localhost:1234     | Desktop app for local LLMs            |
| llamacpp   | http://localhost:8080     | CPU-friendly, quantized models        |
| mlx        | http://localhost:8080     | Apple Silicon optimized via mlx-lm    |

---

## Publishing Models

### How publishing works

When you run `cllmhub publish`, the CLI opens an interactive flow that discovers your available models — both downloaded GGUF models (internal engine) and models from external engines like Ollama or vLLM. Select a model to publish and the CLI registers it with the hub, starts sending heartbeats every 30 seconds, and keeps your model online. As long as the CLI is running, your model appears as "Active" and can receive requests.

### Authentication

Run `cllmhub login` before publishing. The CLI uses an OAuth device flow — it opens a browser where you sign in with your cLLMHub account and approve the device. Your session persists until you log out or revoke it.

### Model lifecycle

- **Active** — CLI is running and sending heartbeats
- **Inactive** — CLI stopped or lost connection (removed after ~120 seconds of missed heartbeats)
- **Disabled** — You manually toggled the model off from the My Models page; all requests return 403

### Example: Ollama

```bash
# Start Ollama
ollama serve

# Pull a model
ollama pull llama3

# Publish to the hub
cllmhub publish
```

### Example: vLLM

```bash
# Start vLLM server
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-70b

# Publish to the hub
cllmhub publish
```

---

## API Keys & Access Control

### Creating keys

Go to Settings > API Keys. Select one or more models, optionally give the key a name, and click Create. The full key (`sk-...`) is shown once — copy it immediately.

### Model scoping

Each API key is scoped to specific models. If a consumer sends a request to a model not in the key's scope, the hub returns `403 Forbidden`. This lets you hand out narrowly-scoped keys to different users or services.

### Example

```bash
# Works — key is scoped to llama3-70b
curl https://cllmhub.com/v1/chat/completions \
  -H "Authorization: Bearer sk-abc..." \
  -d '{"model": "llama3-70b", "messages": [...]}'

# Fails with 403 — key is NOT scoped to mixtral
curl https://cllmhub.com/v1/chat/completions \
  -H "Authorization: Bearer sk-abc..." \
  -d '{"model": "mixtral-8x7b", "messages": [...]}'
```

### Key limits

- Free tier: up to 3 API keys
- Pro tier: unlimited
- Max tier: unlimited

### Security best practices

- Treat API keys like passwords — never commit them to version control
- Use separate keys for different environments (dev, staging, prod)
- Rotate keys if you suspect a leak by deleting the old key and creating a new one

---

## Hives

Hives are collaborative model pools that let multiple users share their API keys and models through a single unified endpoint.

### What are Hives?

A hive owner creates the hive, invites members, and members contribute their personal API keys — making their models available to the group. The owner then creates hive keys that applications use, and the hub automatically distributes requests across all contributed members using round-robin.

### Creating a hive

Go to the Hives page and click "Create Hive". Give it a name and select one of your personal API keys to contribute. You become the owner and first member.

### Inviting members

As the hive owner, invite members by their email address. They will see a pending invitation on their Hives page and can accept to join.

### Contributing keys

Once a member accepts an invitation, they can contribute one of their personal API keys to the hive. The models on that key become available in the hive pool. Members can change their contributed key at any time.

### Hive keys

The hive owner creates hive keys — API keys that route requests through the pooled member keys. When creating a hive key, the owner selects which models (from the pool) the key can access. Hive keys work exactly like regular API keys in API calls.

### How routing works

When a request comes in with a hive key, the hub finds all members whose contributed key supports the requested model, checks per-member rate limits, and selects one using round-robin. The request is then routed through that member's provider.

### Key behavior

When a key is contributed to a hive, only the hive key's restrictions apply (model scope, rate limits, IP allowlists). The individual key's personal restrictions are bypassed. All requests are logged under the hive, not individual contributors.

### Member management

- Owners can remove members at any time
- Members can leave a hive voluntarily
- Each member can have an optional daily request limit on their contribution
- Owners see all hive request logs; members see only their own contributed key's logs

### Hive key limits

- Free tier: up to 3 hive keys per hive
- Pro tier: unlimited
- Max tier: unlimited

---

## API Reference

### Base URL

```
https://cllmhub.com
```

### Authentication

All API requests require a Bearer token:

```
Authorization: Bearer sk-your-api-key
```

### GET /v1/models

List all models available on the hub.

**Response:**

```json
{
  "object": "list",
  "data": [
    {
      "id": "llama3-70b",
      "object": "model",
      "created": 1700000000,
      "owned_by": "provider-name"
    }
  ]
}
```

### POST /v1/completions

Generate a text completion from a prompt.

**Request:**

```json
{
  "model": "llama3-70b",
  "prompt": "Once upon a time",
  "max_tokens": 256,
  "temperature": 0.7,
  "stream": false
}
```

**Response:**

```json
{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "model": "llama3-70b",
  "choices": [{
    "text": " there was a curious fox...",
    "index": 0,
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 12,
    "total_tokens": 16
  }
}
```

### POST /v1/chat/completions

Generate a chat completion from a list of messages.

**Request:**

```json
{
  "model": "llama3-70b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 256,
  "stream": false
}
```

**Response:**

```json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "llama3-70b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 9,
    "total_tokens": 29
  }
}
```

### Request Parameters

| Parameter     | Type             | Required | Description                                      |
|---------------|------------------|----------|--------------------------------------------------|
| model         | string           | Yes      | Model ID to use for the request                  |
| messages      | array            | Yes*     | Chat messages (chat completions only)             |
| prompt        | string           | Yes*     | Text prompt (completions only)                    |
| max_tokens    | integer          | No       | Maximum tokens to generate (default: 512)         |
| temperature   | float            | No       | Sampling temperature 0-2 (default: 0.7)           |
| top_p         | float            | No       | Nucleus sampling threshold (default: 1.0)          |
| stream        | boolean          | No       | Enable SSE streaming (default: false)              |
| stop          | string or array  | No       | Stop sequence(s) to end generation                |

---

## Streaming

Set `"stream": true` in any completion or chat completion request. The response is delivered as Server-Sent Events (SSE) — each token arrives as a JSON chunk prefixed with `data: `. The stream ends with a `data: [DONE]` sentinel.

### Chat streaming example

```bash
curl https://cllmhub.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key" \
  -d '{
    "model": "llama3-70b",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'
```

### SSE response format

```
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"llama3-70b","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"llama3-70b","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"llama3-70b","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":"stop"}]}

data: [DONE]
```

The first chunk includes the role (`assistant`), subsequent chunks include content tokens, and the final chunk has a `finish_reason`.

---

## OpenAI SDK Compatibility

cLLMHub exposes an OpenAI-compatible API. Any tool, library, or framework that works with OpenAI can be pointed at your hub by changing the base URL and API key. No code changes beyond configuration.

### Python

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://cllmhub.com/v1",
    api_key="sk-your-api-key",
)

# Chat completion
response = client.chat.completions.create(
    model="llama3-70b",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="llama3-70b",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

### Node.js

```javascript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://cllmhub.com/v1",
  apiKey: "sk-your-api-key",
});

const response = await client.chat.completions.create({
  model: "llama3-70b",
  messages: [{ role: "user", content: "Hello!" }],
  max_tokens: 128,
});
console.log(response.choices[0].message.content);
```

### curl

```bash
curl https://cllmhub.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key" \
  -d '{
    "model": "llama3-70b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'
```

---

## Pricing & Limits

### Free tier — $0/month

- 1 model
- 3 API keys
- 5,000 requests per month
- No credit card required

### Pro tier — $3/month

- All models
- Unlimited API keys
- 50,000 requests per month
- Hives (team pools)

### Max tier — $20/month

- Unlimited everything
- Priority routing
- Unlimited hives

### No token-based billing

Unlike hosted inference platforms, cLLMHub does not charge per token. You run the hardware, you pay a flat fee for the platform. Your costs are predictable regardless of how much text you generate.

### What counts as a request?

Every call to `/v1/completions` or `/v1/chat/completions` counts as one request, whether streaming or not. Calls to `/v1/models` do not count against your limit.

### What happens at the limit?

When an account reaches its monthly request limit, subsequent requests return `429 Too Many Requests` with a message to upgrade. The counter resets at the start of each billing period.

Manage your subscription from the [Subscription page](https://cllmhub.com/subscription).

---

## Request Logs & Monitoring

The Logs page (https://cllmhub.com/logs) shows every API request made with your keys. Each entry includes:

- Timestamp
- Model name
- API key name or suffix
- Input/output character counts
- Prompt and completion token counts
- Tokens per second
- Latency in milliseconds
- Status (success or error with message)

Filter by API key name, model, or provider using the dropdowns. The My Models page (https://cllmhub.com/my-models) shows your published models, their status, and a preview of recent requests.

---

## Model Visibility

All published models are listed in `/v1/models` and visible to anyone on the hub. Access is still controlled by API keys — a consumer needs a key scoped to your model to actually use it.

From the My Models page, toggle a model to "Disabled" to immediately block all inference requests. The model remains registered but returns 403 to any consumer. Re-enable it at any time.

---

## Error Handling

### Error format

```json
{
  "error": {
    "message": "Description of what went wrong",
    "type": "invalid_request_error",
    "code": "invalid_request"
  }
}
```

### HTTP status codes

| Status | Code              | Description                                    |
|--------|-------------------|------------------------------------------------|
| 400    | invalid_request   | Malformed JSON or missing required fields      |
| 401    | unauthorized      | Missing or invalid API key                     |
| 403    | forbidden         | API key not scoped to the requested model, or model disabled |
| 429    | rate_limit        | Monthly request limit exceeded                 |
| 502    | provider_error    | The upstream provider returned an error        |
| 503    | providers_busy    | All providers are busy, retry later            |

### Streaming errors

If an error occurs mid-stream, it is sent as a `data: [ERROR] ...` event. The client should handle this gracefully and may retry the request.

### Common troubleshooting

- **"No models available"** — Your CLI may have disconnected. Check that `cllmhub publish` is still running. Models are removed after ~120 seconds of missed heartbeats.
- **"403 Forbidden"** — Your API key is not scoped to the requested model, or the model owner has disabled it.
- **"502 Bad Gateway"** — The provider's backend (Ollama, vLLM, etc.) may have crashed or be unresponsive. Check the provider machine.
- **"429 Too Many Requests"** — You have exceeded your tier's monthly request limit. Upgrade your plan or wait for the next billing period.