GC AI rate-limits the API in two tiers. Inference endpoints (the model-backed calls) are limited tightly. Everything else (listing, creating, and updating projects, playbooks, files, and so on) gets a far more forgiving limit.
Limits are intentionally conservative during the free beta, where the rate limit is the main guardrail on usage. They will loosen as the API moves toward general availability.
The two tiers
| Tier | Endpoints | Limit (beta) |
|---|
| Inference | POST /chat/completions, POST /playbooks/{id}/run | ~1 request / minute per organization, with a burst of 3 |
| Everything else | All other endpoints (files, folders, projects, playbooks CRUD, profiles, …) | 120 requests / minute per organization |
The inference limit is a window of 3 requests per 180 seconds: you can spend a burst of 3 right away, after which it averages out to roughly one per minute.
What counts against the limit
Limits apply per organization, and within that per API key. GC AI checks both on every request, and whichever is exhausted first blocks it:
- The per-organization limit is the binding ceiling. It is shared across every key the organization holds, so minting extra keys does not raise your total throughput.
- The per-key limit keeps one integration from monopolizing the organization’s budget. During the beta it matches the org limit; as limits rise toward GA it becomes the per-integration sub-limit.
When you exceed a limit
A throttled request returns 429 Too Many Requests:
{
"error": "Rate limit exceeded",
"code": "RATE_LIMITED"
}
It also carries headers describing the limit and when to retry:
| Header | Meaning |
|---|
Retry-After | Seconds to wait before retrying. |
RateLimit-Limit | The request quota for the window that was hit. |
RateLimit-Remaining | Requests remaining in the current window. |
RateLimit-Reset | Seconds until the quota resets. |
Honor Retry-After: wait that many seconds before sending the next request. For batch workloads, use fire-and-forget and pace your enqueues rather than retrying in a tight loop.