API Reference: /v1/chat/completions

API Reference

OpenAI-compatible REST API. One endpoint for chat completions with auth, PII protection, memory, and rate limiting built in.

Base URL

https://your-project.behest.app

Each project gets a dedicated subdomain. Find yours in the Behest dashboard.

Authentication

All requests require a Bearer token in the Authorization header.

Authorization: Bearer your-api-key

API keys are generated per-project in the dashboard. Keys are hashed with Argon2id and cannot be retrieved after creation — store them securely.

POST /v1/chat/completions

Create a chat completion. OpenAI-compatible request and response format.

Request Body

Parameter	Type	Required	Description
`model`	string	Yes	The model to use. Currently supported: `gemini-2.5-flash`, `gemini-2.5-pro`
`messages`	array	Yes	Array of message objects with `role` ("user", "assistant", "system") and `content` (string)
`stream`	boolean	No	Enable server-sent events streaming. Default: `false`
`temperature`	number	No	Sampling temperature (0-2). Higher values increase randomness. Default: `1.0`
`max_tokens`	integer	No	Maximum tokens in the response. Defaults to model maximum.

Custom Headers

Header	Required	Description
`Authorization`	Yes	Bearer token: `Bearer your-api-key`
`Content-Type`	Yes	Must be `application/json`
`X-End-User-Id`	No	Unique identifier for the end user. Enables per-user memory, rate limiting, token budgets, and analytics.
`X-Session-Id`	No	Uniquely identifies a conversation thread for per-session token attribution in your analytics dashboard. Use any stable identifier — typically a chat session ID or a `user-{userId}-conv-{conversationId}` pair. Requests without this header collapse into a single “no session” bucket.

Full Request Example

const response = await fetch(
  "https://your-project.behest.app/v1/chat/completions",
  {
    method: "POST",
    headers: {
      "Authorization": "Bearer your-api-key",
      "Content-Type": "application/json",
      "X-End-User-Id": "user-12345",
      // Uniquely identifies a conversation thread for per-session cost attribution.
      "X-Session-Id": "user-12345-conv-abc",
    },
    body: JSON.stringify({
      model: "gemini-2.5-flash",
      messages: [
        { role: "user", content: "Summarize this contract" }
      ],
      temperature: 0.7,
      max_tokens: 1024,
    }),
  }
);

Response Format

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1709000000,
  "model": "gemini-2.5-flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here is a summary of the contract..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 150,
    "total_tokens": 175
  }
}

Rate Limit Headers

Every response includes rate limit headers so your app can handle limits gracefully.

Header	Description
`X-RateLimit-Limit`	Maximum requests per minute for this tier
`X-RateLimit-Remaining`	Requests remaining in the current window
`X-RateLimit-Reset`	Unix timestamp when the current window resets

Error Codes

Code	Meaning	Common Cause
`400`	Bad Request	Missing required fields, invalid model name, malformed JSON
`401`	Unauthorized	Missing or invalid API key, expired token
`403`	Forbidden	CORS origin not allowed, project kill switch active, prompt blocked by Sentinel
`429`	Too Many Requests	Rate limit exceeded (per-IP, per-project, or per-user). Check `X-RateLimit-Reset` header for retry timing.
`503`	Service Unavailable	Upstream LLM provider error or temporary service disruption. Retry with exponential backoff.

Quickstart Guide CORS Guide Auth Guide