Skip to main content

    Rate Limiting

    Behest enforces rate limits at multiple levels, all handled in the Kong behest-tenant-auth plugin before a request reaches LiteLLM. Limits are checked in this order:

    1. Per-IP safety limit
    2. Per-project RPM limit
    3. Per-user RPM limit
    4. Per-user daily token budget
    5. Per-user monthly token budget (if configured)
    6. Per-project aggregate daily token budget

    All counters are stored in Redis. Keys expire automatically (2 minutes for RPM counters, 25 hours for daily token counters).


    Limit Types and Defaults

    Per-IP Safety Limit

    Default: 120 requests/minute per IP address

    Redis key: rpm:ip:{clientIP}:{YYYYMMDDHHMM}

    This limit applies to all requests regardless of authentication status. It is a safety valve against traffic floods from a single IP. The limit is read from conf.default_ip_rpm_limit (Kong plugin configuration). When exceeded, the request is rejected before JWT validation runs.

    Per-Project RPM Limit

    Default: 60 requests/minute per project

    Redis key: rpm:{tenantId}:{projectId}:{YYYYMMDDHHMM} Config key: config:{projectId}:rpm_limit

    The project RPM limit is stored in Redis at config:{projectId}:rpm_limit after each deploy. Kong reads this key on every request. If the key is missing, the plugin falls back to conf.default_rpm_limit (default 60).

    All requests to a project (regardless of which end user) count toward this limit.

    Per-User RPM Limit

    Default: 1/10th of the project RPM limit, minimum 3

    Redis key: rpm:{projectId}:{userId}:{YYYYMMDDHHMM}

    The per-user limit is derived dynamically:

    lua
    user_limit = math.max(3, math.floor(rpm_limit / per_user_fraction))
    -- where per_user_fraction defaults to 10

    At the default 60 RPM project limit: per-user limit = max(3, floor(60/10)) = 6 RPM.

    End users are identified by the uid claim in the Behest JWT, passed through as the X-End-User-Id header. Requests without a user ID (service accounts, API keys) skip the per-user check.

    Per-User Daily Token Budget

    Default: 1,000,000 tokens/day per end user

    Redis key: tokens:{projectId}:{userId}:{YYYYMMDD} (UTC) Config key: config:{projectId}:tokens_per_day

    The token count is read by Kong (pre-request check) and written by LiteLLM's token budget hook (post-response, using INCRBY with actual token count). This means there is a small race window under high concurrency where multiple simultaneous requests can all pass the pre-check before any of them contribute to the counter. Overshoot is bounded by concurrent_requests * avg_tokens_per_request.

    The check is fail-open for Redis errors — if Redis is unavailable when reading the token count, the request proceeds. This is intentional: blocking all requests because of a Redis read failure would be worse than allowing temporary budget overshoot.

    Per-User Monthly Token Budget

    Default: Not enforced (no limit) unless explicitly configured

    Redis key: tokens:{projectId}:{userId}:{YYYYMM} (UTC) Config key: config:{projectId}:tokens_per_month

    If config:{projectId}:tokens_per_month is not set in Redis, the monthly check is skipped entirely. Configure this via project settings to enforce a hard monthly cap per end user.

    Per-Project Aggregate Daily Token Budget

    Default: 10,000,000 tokens/day for the entire project

    Redis key: tokens:{tenantId}:{projectId}:{YYYYMMDD} (UTC) Config key: config:{projectId}:project_tokens_per_day

    This is the total token budget across all end users of a project. If config:{projectId}:project_tokens_per_day is not in Redis, the plugin falls back to the default 10M. Same fail-open behavior as per-user budget for Redis errors.


    Configuring Limits

    Limits are stored as part of project settings. They take effect when you deploy your project.

    Dashboard

    Go to Projects → [your project] → Settings → Limits

    Fields:

    • Requests per minute — maps to config:{pid}:rpm_limit
    • Tokens per day (per user) — maps to config:{pid}:tokens_per_day
    • Tokens per month (per user) — maps to config:{pid}:tokens_per_month

    API

    http
    PUT /v1/projects/:projectId/settings
    Authorization: Bearer <service-JWT>
    Content-Type: application/json
     
    {
      "rpm_limit": 120,
      "tokens_per_day": 500000
    }

    Then deploy to push to Redis:

    http
    POST /v1/projects/:projectId/settings/deploy
    Authorization: Bearer <service-JWT>

    Rate Limit Headers

    Kong sets rate limit headers on every response (both allowed and rate-limited):

    HeaderValue
    X-RateLimit-LimitThe project RPM limit
    X-RateLimit-RemainingRequests remaining in the current minute window
    X-RateLimit-ResetSeconds until the current 1-minute window resets

    Example response headers:

    X-RateLimit-Limit: 60
    X-RateLimit-Remaining: 47
    X-RateLimit-Reset: 23
    

    The reset value is 60 - current_second within the UTC minute (e.g., if you're at second 37, reset = 23).


    When Rate Limited

    A rate-limited request returns:

    http
    HTTP/1.1 429 Too Many Requests
    X-RateLimit-Limit: 60
    X-RateLimit-Remaining: 0
    X-RateLimit-Reset: 12
     
    {"message": "Rate limit exceeded"}

    For daily token budget exhaustion:

    http
    HTTP/1.1 429 Too Many Requests
     
    {"message": "Daily token budget exceeded"}

    There is no Retry-After header currently. Use X-RateLimit-Reset to determine when to retry for RPM limits.


    Kill Switches

    Kill switches are separate from rate limits but checked by the same Kong plugin. They return 503 Service Temporarily Unavailable (not 429). Kill switches exist at three granularities:

    ScopeRedis key
    Globalkillswitch:global
    Tenantkillswitch:tenant:{tenantId}
    Projectkillswitch:project:{projectId}

    A kill switch is active when the Redis key exists and its value is "1". Kill switches are checked before rate limits in the request flow.


    Redis Key Summary

    CounterRedis keyWindowTTL
    Per-IP RPMrpm:ip:{ip}:{YYYYMMDDHHMM}1 minute120s
    Per-project RPMrpm:{tid}:{pid}:{YYYYMMDDHHMM}1 minute120s
    Per-user RPMrpm:{pid}:{uid}:{YYYYMMDDHHMM}1 minute120s
    Per-user daily tokenstokens:{pid}:{uid}:{YYYYMMDD}Calendar day (UTC)~25h
    Per-user monthly tokenstokens:{pid}:{uid}:{YYYYMM}Calendar month (UTC)~33d
    Per-project daily tokenstokens:{tid}:{pid}:{YYYYMMDD}Calendar day (UTC)~25h

    The RPM counter TTL is 120 seconds (2 minutes) to handle clock drift between Kong workers. The counter for a given minute window is still accurate within the minute because the INCR is atomic.


    Best Practices for Your Application

    Use exponential backoff with jitter when you receive a 429:

    javascript
    async function callWithRetry(fn, maxRetries = 3) {
      for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
          return await fn();
        } catch (err) {
          if (err.status === 429 && attempt < maxRetries - 1) {
            const resetSeconds = parseInt(
              err.headers?.["x-ratelimit-reset"] || "5",
              10
            );
            const jitter = Math.random() * 1000;
            await new Promise((r) => setTimeout(r, resetSeconds * 1000 + jitter));
          } else {
            throw err;
          }
        }
      }
    }

    Respect X-RateLimit-Remaining — if remaining is 0 or close to 0, pause before the next request without waiting for a 429.

    Distribute end user requests — the per-user limit is 1/10th of the project limit. If you expect bursts from individual users, raise the project RPM limit to keep the per-user limit proportionally high.

    Set realistic token budgets — the default 1M tokens/day per user is generous for interactive use. For cost control, lower this value and raise it per tier using the tier overrides system.

    Use tier-based overrides — Behest supports project tiers (e.g., free, pro, enterprise), each with their own rpm_limit and tokens_per_day overrides. Configure tiers to give different users different limits within the same project without separate deployments.

    Enterprise Token FinOps: Enforce hard budgets and attribute costs per session.

    Learn more