Every team building on AI eventually asks the same question: what is the cheapest LLM API? Token costs are small per call but add up fast at scale, and the wrong setup can quietly double your bill. The catch is that “cheapest” is not simply the lowest sticker price on a single model — it depends on markups you may not see, whether you send every task to an expensive model, and how efficiently you use the tokens you pay for.
This is where a tool like
OrcaRouter changes the math. As a zero-markup
AI gateway, it lets you pay each provider’s real token price with nothing added on top, then route cheap, high-volume work to inexpensive models and reserve premium models for the hard tasks. Below, we break down what actually makes an LLM API cheap, the hidden costs to watch for, and four concrete ways to cut your bill without sacrificing quality.
What actually makes an LLM API cheap
Sticker price per token is only one input. The real cost of an LLM API is the price you pay to complete a task well — and that is shaped by three things beyond the headline rate: whether a markup is added on top of the provider’s price, whether you are using a model that is the right size for each task, and how many tokens you burn getting a usable answer. Optimize all three and a “more expensive” model can end up cheaper per finished job than a bargain one you have to
The hidden markup
Many API aggregators quietly add a margin — often 5–20% — on top of the provider’s token price. On a hobby project that is invisible; at production volume it is a recurring tax on every single call, for a service you could get without the surcharge. The first move toward the cheapest LLM API is simply to stop paying markup: choose a gateway that charges nothing on tokens and passes the provider’s published rate straight through.
Four ways to actually cut cost
Once the markup is gone, four levers do the heavy lifting:
Zero markup — pay the provider’s real token rate, not a marked-up one
Route to the right model — send simple, high-volume tasks (classification, extraction, short summaries) to small, cheap models and escalate only hard tasks to premium models
Cache repeated context — reuse stable prompt prefixes so you are not billed to re-process the same tokens on every call
Right-size the effort — use lighter reasoning settings for easy tasks and heavier ones only where they change the answer
A gateway makes all four a matter of configuration rather than custom engineering, which is why cost-conscious teams standardize on one.
Compare cost fairly: per completed task
The fairest way to compare LLM APIs is not price per token but cost per completed task — the total spend, including retries and failed attempts, to get an answer you can ship. A cheaper model that needs three tries can cost more than a pricier model that nails it once. Measure the finished-job cost across a representative sample before you commit, and let routing send each task to whichever model wins on that metric.
How OrcaRouter keeps it cheap
OrcaRouter is built for exactly this. It adds zero markup, so you pay providers’ published rates directly; it reaches 200+ models through one OpenAI-compatible endpoint, so you can always pick the cheapest capable model; it supports smart routing and prompt caching to trim spend automatically; and it starts on a free plan. The result is the practical definition of the cheapest LLM API: real provider prices, no surcharge, and the freedom to route every task to its most cost-effective model.
How to get started
1. Create a free account and generate an API key
2. Point your existing OpenAI SDK at the gateway’s base URL
3. Route high-volume, simple tasks to cheap models; reserve premium models for hard ones
4. Turn on caching for repeated context, and track cost per task in one dashboard
A simple cost-control playbook
If you want a starting point, here is a playbook that works for most teams. First, remove markup by choosing a zero-markup gateway, so every rate you see is the provider’s real price. Second, classify your traffic: most apps have a large volume of simple calls (routing, tagging, short summaries) and a smaller volume of hard ones. Send the simple majority to a small, cheap model and let only the hard minority reach a premium model. Third, turn on caching for any prompt with a stable prefix — system instructions, few-shot examples, retrieved context — so you stop paying to re-read the same tokens on every call. Fourth, review cost per completed task weekly and adjust the routing rules; the cheapest setup is rarely static as your traffic mix shifts.
Teams that follow this playbook routinely cut their AI bill by a large margin without users noticing any drop in quality — because the premium models are still doing the premium work, just not the cheap work they were quietly overpaying for. One more habit helps: benchmark before you assume. Run a representative sample of your real tasks across a few candidate models through the same endpoint, compare cost per completed task, and let the data pick your default and your escalation model. Because a gateway makes swapping models trivial, this experiment costs an afternoon and often pays for itself immediately.
The bottom line
The cheapest LLM API isn’t a model — it’s a setup: real provider rates with zero markup, cheap models doing the cheap work, caching doing the repetitive work, and premium models reserved for the calls that earn their price. Build that once and your AI bill stops being a mystery and starts being a dial you control.
Want the cheapest path to every model? Start free with
OrcaRouter — zero markup, 200+ models through one OpenAI-compatible endpoint, with smart routing and caching to cut costs.
FAQ
How much of a typical LLM bill is output vs input tokens?
Output usually dominates even though there are fewer output tokens, because they’re priced several times higher than input. That’s why capping response length and telling models to answer concisely are two of the cheapest optimizations available.
Are batch or off-peak discounts worth using?
For non-urgent workloads, yes — several providers offer significant discounts for asynchronous batch processing where results come back within hours instead of seconds. It suits nightly summarization or bulk classification, not user-facing chat.
Does prompt length affect cost even if the answer is short?
Yes — every input token is billed on every call, so a bloated system prompt is a recurring tax. Trim instructions you don’t need, and cache the stable prefix so repeated context bills at the cheaper cached rate instead of full price.