Aerostack
Available Now

One endpoint. Every model.

The AI Gateway
built for builders.

Ship an LLM gateway — or a full AI API — with a built-in RAG knowledge base, content moderation, an LLM router with multi-provider fallback, and per-consumer rate limiting. Your users get an OpenAI-compatible LLM proxy endpoint. Configure it all from the admin — no code required.

AI Gateway LLM Router LLM Proxy RAG Knowledge Base Moderation Billing Plans Consumer Keys BYO-JWT
// Pipeline

Your AI API gateway pipeline. Toggle each stage.

Every stage is optional except LLM. Enable what you need — skip what you don't. This is what sets an AI gateway apart from a plain LLM proxy.

Moderation

AI-powered content safety check on every request before it reaches the LLM.

Classifies input as SAFE or UNSAFE using a dedicated model
Block mode returns 400 immediately for unsafe content
Flag mode continues but marks metadata for your review
Non-fatal — if the check fails, the request proceeds

RAG

Retrieve relevant context from your knowledge base and inject it into the prompt.

Upload documents — auto-chunked and embedded
Vector search finds relevant context for each query
Context injected as a system message before the user's question
Configurable similarity threshold and top-k results

Pre-Hook

Run custom logic before the LLM call — re-rank chunks, add user context, enforce rules.

Dispatches to your deployed edge function
Receives messages, metadata, and retrieved chunks
Can modify messages and inject custom context
Perfect for business rules and personalization

LLM (required)

Call any LLM provider with automatic fallback chains and streaming.

OpenAI, Anthropic, Gemini, Groq, Azure, Workers AI
Fallback chains — if provider A fails, auto-route to B
Configure fallbacks per HTTP status code (429, 503...)
Full SSE streaming to your consumers

Post-Hook

Run custom logic after streaming begins — log, transform metadata, trigger side effects.

Same edge function dispatch as pre-hook
Runs after the response starts streaming
Can modify metadata but not the response body
Useful for logging, analytics, and billing events
// Knowledge Base

An LLM gateway with a built-in RAG knowledge base.

Upload documents, pick an embedding model, and ground every response in your own content. No external vector database to wire up.

BGE Small

384-d · fastest

BGE Base

768-d · balanced

BGE Large

1024-d · most accurate

Upload your docs

PDF, txt, md, json, csv, html, xml, yml, toml, and more — up to 5 MB each. Auto-chunked and indexed with live per-document status.

Live RAG test chat

Query your live vector index and see the exact chunk matches with similarity scores — so you know precisely what context the LLM will receive.

Tune retrieval

Top-k and score-threshold are configurable per pipeline stage, so you control how much context gets injected into each prompt.

// LLM Router & LLM Proxy

LLM router + OpenAI-compatible LLM proxy.

Consumers call your AI gateway's OpenAI-compatible LLM proxy endpoint — you control what runs behind it. Configure a chain of providers: if your primary fails, the next one picks up automatically, per status code. Or run Cloudflare Workers AI — 50+ edge-native models, no API key required.

Primary: Claude Sonnet

active
fails →

Fallback 1: GPT-4o

on 429, 503

standby
fails →

Fallback 2: Gemini Flash

on any error

standby
fails →

Fallback 3: Cloudflare Workers AI

zero-cost last resort — no API key

standby

Per-status-code routing

Route 429 (rate limit) to one provider, 503 (outage) to another. Fine-grained control.

Workers AI or BYOK

Use Cloudflare Workers AI with zero keys, or bring your own keys (BYOK) for OpenAI, Anthropic, Gemini, and Groq — stored as encrypted gateway secrets.

OpenAI-compatible proxy

Consumers send OpenAI format to your LLM proxy endpoint. You run Claude, Gemini, Groq, or Workers AI behind it.

// Plans & Limits

Define plans and rate limits per consumer.

Four billing models with full parameter control — plus one-click templates to deploy a configured gateway in seconds.

Free

Token-limited, no charge. The frictionless on-ramp for new consumers.

Flat rate

A fixed monthly price for a defined token allowance.

Metered

A base price plus a configurable overage per 1k tokens.

Tiered

A token allowance, then per-token overage once it is used up.

Trial days

Offer a free trial window before a plan starts charging.

RPM limit

Cap requests per minute, per plan — not just globally.

TPM limit

Cap tokens per minute, per plan, to protect upstream cost.

Hard limits

Set a token allowance and a hard cutoff to stop runaway usage.

Deploy from a template in seconds

Start from a library of one-click templates — including RAG-powered and function-backed presets with suggested plans and pipeline configs built in. Ship a free public endpoint first, then add a key-gated plan with usage limits when you are ready.

// Function-backed

Beyond routing — function-backed gateway APIs.

Not every AI API fits a standard LLM pipeline. Wire a gateway API directly to a deployed Cloudflare Worker for fully custom business logic — the AI API gateway becomes a configurable proxy to your own code.

Cloudflare Worker backend

Dispatch every gateway request to your deployed edge function. Receive the full request body, headers, and consumer metadata. Return any response shape you need.

Embeddable chat widget

Drop the hosted chat widget onto any site — it connects to your gateway API and inherits all your pipeline stages: RAG, moderation, rate limits, and billing.

One-click templates

Start from a function-backed gateway template with billing plans and pipeline config pre-wired. Go from zero to a live, custom AI API in minutes.

RAG works at three layers — not just the gateway

The gateway's built-in RAG knowledge base is one entry point. Aerostack also supports RAG in bot freestyle mode via the enable_rag flag — so every conversation is grounded in your docs automatically — and as a knowledge_retrieval workflow node you can wire anywhere in a multi-step agent graph. One knowledge base, three integration points.

// Consumer keys

Your API. Their keys.

Each consumer gets a unique API key. You control access, track usage, and bill per token — all automatic.

consumer-call.sh
# Your consumer calls your API — not OpenAI's
curl -X POST https://gateway.aerostack.dev/my-api/v1/chat/completions \
  -H "Authorization: Bearer ask_live_7f3a9c2e4b1d..." \
  -d '{"messages":[{"role":"user","content":"Summarize Q4 revenue"}],"stream":true}'

# OpenAI-compatible response — any SDK works
# RAG context, moderation, and billing all happen transparently

One key per consumer

Issue API keys with ask_live_ prefix. SHA-256 hashed — raw key shown once. Revoke or regenerate anytime.

Token wallet billing

Each consumer has a token balance. Every request deducts tokens used. Set hard limits to prevent overspend.

OpenAI-compatible endpoint

Your consumers use the same /v1/chat/completions format they already know. Any OpenAI SDK works out of the box.

BYO-JWT — bring your own auth

Already have an auth system? Validate your own JWTs against your JWKS endpoint. No migration needed.

// Use Cases

What you can build.

AI Customer Support API

RAG pipeline answers from your docs. Moderation catches toxic inputs. Fallback switches providers if one goes down.

Knowledge Base Query API

Embed your docs, connect vector search, and expose an OpenAI-compatible endpoint. Consumers search with natural language.

Moderated Content Generation

Pre-flight moderation blocks unsafe prompts. Post-flight moderation filters unsafe responses. All before your user sees them.

OpenAI-compatible LLM proxy

Drop-in LLM proxy: consumers send the OpenAI format, you route to the cheapest provider that meets your latency SLA. Auto-fallback: OpenAI → Anthropic → Gemini → Groq.

Function-backed custom API

Back a gateway with a deployed Cloudflare Worker function for fully custom business logic, or drop the embeddable chat widget onto any site.

// Why Aerostack

Not another API gateway.

Traditional gateways route traffic. This one adds intelligence.

Feature Kong / AWS API GW OpenRouter Aerostack
Built-in RAG pipeline
Content moderation stage
Pre/post processing hooks ~
Multi-provider fallback chains
Per-consumer plans & token limits
Consumer API key provisioning
BYO-JWT (your own auth)
Edge-deployed (300+ locations) ~
OpenAI-compatible endpoint
One-click deploy templates

Frequently asked questions

What is an LLM gateway, and why do I need one?
An LLM gateway is a managed layer that sits between your application and one or more LLM providers, giving you a single endpoint to handle routing, rate limiting, RAG retrieval, content moderation, and per-consumer usage tracking — without scattering that logic across every client. Without a gateway, provider error handling lives in application code, rate-limiting is rebuilt per project, and there is no central place to swap models or add billing. The Aerostack LLM gateway is part of the platform, so the same workspace that hosts your MCP servers, bots, and workflows also holds the gateway. Routing rules, BYOK secrets, and RAG context travel with the workspace rather than a separate config plane. It runs on Cloudflare Workers, is available now, and requires no code to configure.
What is the difference between an LLM proxy and an LLM gateway?
An LLM proxy forwards requests to a model provider and normalises the API shape so different providers look the same to your code, but it adds no logic of its own. An LLM gateway wraps that proxy in a full pipeline: before the model call you can run RAG retrieval to inject knowledge-base context, apply a content moderation check to flag or block unsafe input, and enforce per-consumer rate limits; after the response you log token count, latency, and stream type. Aerostack exposes an OpenAI-compatible LLM proxy endpoint so existing SDK clients work without code changes, then layers routing, RAG, billing plans, and consumer keys on top of that same endpoint.
How does the built-in LLM router and automatic fallback work?
The LLM router lets you configure an ordered provider chain for each gateway API. Requests go to the first provider; if it returns a retriable error such as a 429 or 503, the router automatically retries the next provider in the chain without the consumer seeing a failure. The fallback is transparent to callers and is recorded in the request log alongside the original attempt. You can chain as many providers as needed — for example, GPT-4o primary, Claude secondary, and Cloudflare Workers AI as a zero-cost last resort. Because the routing config lives in the gateway rather than in application code, you reorder or swap providers without a deployment. This makes the Aerostack LLM router both a reliability layer and a cost-control tool.
Which LLM providers does the Aerostack LLM gateway support?
The LLM gateway supports OpenAI, Anthropic Claude, Google Gemini, and Groq via bring-your-own-key (BYOK). Keys are stored as encrypted gateway-scoped secrets, decrypted at the edge only at request time and never exposed to consumers. Cloudflare Workers AI is also available with no API key required: the model browser shows the full catalogue — text, code, image, embedding, and speech models — with cost tier and context window for each. You can build an entirely key-free LLM proxy using Workers AI, or place it as a fallback behind commercial providers in the router chain. Gateway APIs can also be function-backed, wired to a deployed Cloudflare Worker for custom handling beyond standard LLM routing.
Does the LLM gateway include a built-in RAG knowledge base?
Yes. Each gateway API can have its own RAG knowledge base. You upload documents (PDF, txt, md, json, csv, html, xml — up to 5 MB each) and the gateway auto-chunks and indexes them with live per-document status: indexing, ready, or error. Before going live, a built-in vector test chat lets you query the index and see the exact chunks that would be injected along with similarity scores, so you can verify retrieval quality before any consumer calls the endpoint. At setup you choose one of three embedding models: BGE Small (384-d, fastest), BGE Base (768-d, balanced), or BGE Large (1024-d, most accurate). The model locks after the first upload to keep the vector space consistent. Top-k and score-threshold are configurable per pipeline stage. No external vector database is required.
What billing plans and rate limits can I configure per gateway?
Each gateway supports four billing models. Free is token-limited at no charge, useful for public access or developer trials. Flat rate charges a fixed monthly price. Metered charges a base price plus a configurable overage per 1k tokens once the allowance is consumed. Tiered provides a fixed token allowance then switches to per-token overage. Every plan — including free — carries its own RPM (requests per minute) and TPM (tokens per minute) limits, so a free tier can be stricter than a paid one on the same gateway. Each plan also supports trial days and a hard token cap. This is what makes the LLM gateway pay for itself: metered and tiered plans capture revenue proportional to model usage with no custom billing code.
Can I deploy a gateway from a template, and what is the Free vs. Gateway deploy path?
Yes. A template library provides one-click gateway presets including RAG-powered knowledge APIs, function-backed gateways, and hosted AI APIs — each with suggested billing plans and a pre-built pipeline so you skip the blank-config step. When deploying from a template you choose one of two paths. The free public URL path creates an open endpoint with no consumer key required, suitable for public tools or internal use. The gateway path creates a key-gated endpoint with full consumer management, billing plans, and analytics. You can start on the free path to get a working LLM proxy endpoint immediately and add the key-gated billing layer later without changing the URL. This is the fastest way to go from zero to a live gateway.
How do I authenticate consumers, and what observability does the LLM gateway provide?
Authentication supports three modes: gateway-issued keys (ask_live_ prefix, SHA-256 hashed, shown once), BYO-JWT (JWKS URL plus a custom user-id claim, validated at the edge against your own auth system), and IP allowlists or blocklists enforced before any auth check. Content moderation runs in block mode (request rejected) or flag mode (logged and passed through). For observability, the request log stores the last 200 requests with live-tail polling: each row shows consumer ID, model and provider, tokens consumed, stream type (SSE or WebSocket), and latency. Filters cover consumer, model, provider, and a slow-only view for requests over two seconds. Derived stats include total tokens, average and P95 latency. Analytics breaks usage by consumer and by model-provider over 24h, 7d, and 30d.
What is an AI gateway, and how does it differ from an AI API gateway?
An AI gateway is the AI-specific evolution of a traditional API gateway. Where a conventional AI API gateway handles routing and authentication, an AI gateway adds LLM-native intelligence: RAG retrieval to inject knowledge-base context into every prompt, content moderation to block or flag unsafe inputs, provider-aware fallback chains that understand LLM-specific errors like rate limits (429) and model overloads (503), and per-consumer billing tied to token consumption rather than raw request count. The Aerostack AI gateway is deployed at the edge on Cloudflare Workers (300+ locations) so every request is handled close to the consumer. It exposes an OpenAI-compatible endpoint, so existing SDK clients require no code changes to point at it. You configure the entire pipeline — RAG, moderation, routing, billing — from the admin without writing or deploying any gateway code.
What is a function-backed gateway API, and when would I use one?
A function-backed gateway API replaces the standard LLM pipeline stage with a deployed Cloudflare Worker — your own edge function — as the backend handler. Instead of routing the request to OpenAI, Claude, or another model provider, the gateway dispatches it to your Worker, which receives the full request body, consumer metadata, and any pipeline context, then returns any response shape it wants. This is the right choice when your AI API needs custom business logic that does not fit neatly into the standard moderation → RAG → LLM chain: for example, aggregating multiple model calls, applying proprietary scoring, calling internal services, or returning structured data rather than a chat stream. Function-backed APIs still benefit from all gateway features — consumer key auth, BYO-JWT, rate limiting, billing plans, and request logging — so you get the full AI API gateway infrastructure without giving up control of the core logic. The template library includes function-backed presets to get started in minutes.

Launch your AI gateway.
Configured, not coded.

LLM proxy endpoint. RAG knowledge base. Moderation. LLM router. Plans & limits. All configured from the admin.