"The prompt is too long" / Model Context Length Exceeded
What you're seeing
Errors like:
The prompt is too long: 207601, model maximum context length: 202751This model's maximum context length is 128000 tokens. However, your messages resulted in …Input is too long for the modelcontext length exceeded
These come from the model provider (OpenAI, Anthropic, Google, your Ollama server, GLM-4/5.x, etc.), not from Open WebUI. The provider counted the tokens of everything you sent and rejected the request because it exceeds the model's context window.
Why it happens
The "prompt" a model sees is the entire conversation — not just the message you just typed. Every time you send a new message, Open WebUI forwards:
- Your system prompt
- The full chat history (every previous user/assistant turn in that conversation)
- Any attached files that are inlined into context (not retrieved via RAG)
- Any tool definitions and prior tool call results
- Any inlet-injected context (from filters, RAG, web search, memories, etc.)
- Your newest user message
As a chat grows, the history grows. Large attachments or long tool-call outputs can eat the entire window in a single turn. Once the sum of all of that exceeds the model's context window, the provider rejects the request.
Why Open WebUI doesn't auto-truncate for you
Open WebUI intentionally does not ship a built-in context trimmer. This is a design choice, not an oversight, and it is unlikely to change. Here's why:
- Every model uses a different tokenizer. The token count for the same text differs between OpenAI (tiktoken), Anthropic, Gemini, GLM, Llama-family, Mistral, Qwen, and so on. A truly correct trimmer would need a per-model tokenizer for every provider in existence. Getting that wrong ships silent data corruption.
- Every model has a different context window. 8k, 32k, 128k, 200k, 1M — and that's before you factor in reserved output tokens, provider-side overhead, and multimodal content.
- Everyone wants a different truncation policy. We have seen users ask for all of the following, and all of them are reasonable:
- Trim by token count.
- Trim by number of messages.
- Trim by number of conversational turns.
- Trim only non-system, non-assistant messages.
- Trim file attachments first, keep the dialogue.
- Trim tool-call results first, keep everything else.
- Set a hard ceiling on chat length (block further messages beyond N turns).
- Summarize older messages instead of dropping them, and replace the dropped block with the summary.
- Per-model policies (keep 1M tokens for Gemini, 400k for GPT-5, 32k for smaller local models).
There is no single policy that is correct for every deployment, every user, and every model. A built-in implementation would be wrong for most users by definition, and would hide the much better option: give the user the hook and let them pick.
The supported way: use a filter Function
Context management in Open WebUI is done with filter Functions. inlet() runs on every request before the payload is sent to the model — it receives the full body (including body["messages"]) and can modify it freely. That is the hook you use.
Typical approaches, in increasing order of sophistication:
- Hard chat-length cap. Refuse or error if
len(body["messages"]) > N. Simple and predictable; no tokenization needed. - Newest-N-turns window. Keep the system prompt and only the most recent N user/assistant turns; drop the older ones.
- Token-budget window, per model. Estimate tokens per message (e.g., with
tiktokenfor OpenAI-family or a char/4 heuristic for others) and trim from the oldest non-system message until the total fits the model's window. - Summarize-and-replace. When the window is about to overflow, call a cheap model to summarize the oldest block of messages, then replace that block with a single assistant-authored summary message. Preserves long-running context without busting the window.
- Attachment- or tool-output-first trimming. Strip large file contents or tool results from old turns before touching the dialogue.
Community filters for most of these already exist on the Open WebUI community site. Install one, configure its valves, and you're done. If none fits your policy exactly, copy the closest one into the Functions admin page and edit it — filters are pure Python and easy to tweak.
Minimal example: "newest N turns" filter
Show the full filter code (keeps the last N non-system messages)
from pydantic import BaseModel, Field
class Filter:
class Valves(BaseModel):
priority: int = Field(
default=0,
description="Run before other filters that depend on the final message list.",
)
max_turns: int = Field(
default=20,
description="Maximum number of non-system messages to keep (older are dropped).",
)
def __init__(self):
self.valves = self.Valves()
async def inlet(self, body: dict) -> dict:
messages = body.get("messages", [])
if not messages:
return body
system_msgs = [m for m in messages if m.get("role") == "system"]
other_msgs = [m for m in messages if m.get("role") != "system"]
if len(other_msgs) > self.valves.max_turns:
other_msgs = other_msgs[-self.valves.max_turns :]
# Tool-call repair: after slicing, the new leading messages
# might be orphaned tool-call results or an assistant whose
# tool_calls reference tool messages that got dropped.
# Providers (OpenAI / Anthropic / …) 400 on those — so prune
# until the window starts on something the provider accepts.
while other_msgs and other_msgs[0].get("role") == "tool":
other_msgs.pop(0)
if (
other_msgs
and other_msgs[0].get("role") == "assistant"
and other_msgs[0].get("tool_calls")
):
expected = {tc.get("id") for tc in other_msgs[0]["tool_calls"]}
seen = {
m.get("tool_call_id")
for m in other_msgs[1:]
if m.get("role") == "tool"
}
if not expected.issubset(seen):
other_msgs.pop(0)
body["messages"] = system_msgs + other_msgs
return bodyEnable this filter globally or attach it to specific models in Admin Panel → Functions. The max_turns valve is configurable per-model via the model card, so you can set a smaller window for local 8k models and a larger one for Gemini 1M.
With tool calling on, an assistant message that invokes tools is paired with one or more tool messages carrying results that share the same tool_call_id. If max_turns happens to slice the conversation in the middle of that pair — keeping the orphan half — the upstream provider returns a 400 because the tool call / result structure is invalid. The repair block drops the orphans so the window always starts on a clean boundary. This matches what production community filters for context management do; the rest of the filter is the generic trimming logic.
tool_use ids were found without tool_result blocksA second source of the same 400 is stored output that is already incomplete — a tool result never got written (the call was interrupted, or a knowledge base changed mid-chat), or a tool call is missing while its result survived. Strict providers (Anthropic, AWS Bedrock Converse) reject this with 400 ... tool_use ids were found without tool_result blocks (or the mirror case, a tool_result with no matching tool_use).
As of v0.9.6 Open WebUI reconciles these when it reconstructs a conversation: unpaired tool_use / tool_result entries are dropped before the request is sent, so resuming a chat with an interrupted tool call no longer hard-fails. Well-formed history is untouched. This is independent of the filter above — the filter still matters because trimming can create fresh orphans after reconstruction, which the server-side pass (run earlier, on the stored output) does not see. If you still hit this error, confirm you are on v0.9.6 or later.
Slightly more involved: per-model token budget
Counting turns is easy to reason about but wrong in practice — 40 turns of one-liners fit in 8k tokens, five turns with a 200-page PDF attachment do not. The more useful policy is "keep everything until we're about to bust the model's context window, then drop the oldest non-system messages until we fit."
This second example does that. It:
- Counts tokens with
tiktoken, which ships with Open WebUI so there is no extra dependency to install. It falls back to a char/4 estimate only if the encoding can't be loaded. - Adds a worst-case token allowance for every uploaded image, so a chat full of 4K screenshots still gets trimmed before it busts the window.
- Reads per-model budgets from a valve, so a single instance of the filter works for your 8k local model and your 1M Gemini at the same time.
- Leaves a configurable headroom for the response.
- Re-applies the tool-call repair from the first example after trimming.
Show the full filter code (per-model token-budget trimmer)
import json
import os
from pydantic import BaseModel, Field
try:
import tiktoken # ships with Open WebUI, no extra install needed
except ImportError:
tiktoken = None
class Filter:
class Valves(BaseModel):
priority: int = Field(
default=0,
description="Run before other filters that depend on the final message list.",
)
default_budget_tokens: int = Field(
default=8000,
description="Fallback input-token budget for any model not listed in model_budgets.",
)
response_headroom_tokens: int = Field(
default=2000,
description="Tokens to reserve for the model's reply. Trimmed from the budget before fitting.",
)
tiktoken_encoding: str = Field(
default=os.getenv("TIKTOKEN_ENCODING_NAME", "cl100k_base"),
description=(
"Fallback tiktoken encoding when the model's own is unknown. "
"cl100k_base ships pre-cached with Open WebUI and works offline; "
"o200k_base covers recent OpenAI models but is fetched on first use."
),
)
tokens_per_image: int = Field(
default=1600,
description=(
"Worst-case tokens charged per image part (assume a 4K, high-detail "
"upload). OpenAI high-detail tops out near 1445, Claude near 1590; "
"raise it for Gemini high-resolution tiling, lower it for low-detail."
),
)
model_budgets_json: str = Field(
default=(
'{\n'
' "gpt-5.5": 1000000,\n'
' "claude-opus-4-8": 1000000,\n'
' "claude-sonnet-4-6": 1000000,\n'
' "gemini-3.5-flash": 1000000,\n'
' "qwen3.6": 262144,\n'
' "deepseek-v4": 1000000\n'
'}'
),
description="JSON mapping of model id (or prefix) to input-token budget.",
)
def __init__(self):
self.valves = self.Valves()
self._encoders = {} # cache loaded tiktoken encoders
# ---- helpers -----------------------------------------------------------
def _encoder(self, model_id: str):
# Cached per-model encoder; None if tiktoken or the vocab can't load.
if tiktoken is None:
return None
key = model_id or self.valves.tiktoken_encoding
if key not in self._encoders:
enc = None
try:
enc = tiktoken.encoding_for_model(model_id)
except Exception:
try:
enc = tiktoken.get_encoding(self.valves.tiktoken_encoding)
except Exception:
enc = None
self._encoders[key] = enc
return self._encoders[key]
def _estimate_tokens(self, content, model_id: str = "") -> int:
if content is None:
return 0
if isinstance(content, str):
enc = self._encoder(model_id)
if enc is not None:
# disallowed_special=() so literal "<|endoftext|>" text can't raise.
return len(enc.encode(content, disallowed_special=()))
return max(1, len(content) // 4) # fallback if tiktoken can't load
# Some providers deliver multimodal content as a list of parts.
if isinstance(content, list):
total = 0
for part in content:
if not isinstance(part, dict):
continue
if part.get("type") == "image_url" or "image_url" in part:
total += self.valves.tokens_per_image # worst-case per image
else:
total += self._estimate_tokens(part.get("text", ""), model_id)
return total
return 0
def _message_tokens(self, msg: dict, model_id: str = "") -> int:
# Content + a small per-message overhead for role/formatting.
tokens = self._estimate_tokens(msg.get("content"), model_id)
# Tool calls carry arguments in JSON; count them too.
for tc in msg.get("tool_calls") or []:
args = tc.get("function", {}).get("arguments", "")
tokens += self._estimate_tokens(args, model_id)
return tokens + 4
def _budget_for(self, model_id: str) -> int:
try:
budgets = json.loads(self.valves.model_budgets_json or "{}")
except Exception:
budgets = {}
if model_id in budgets:
return int(budgets[model_id])
# Prefix match: "claude-sonnet-4-6-20260514" uses the "claude-sonnet-4-6"
# budget. Longest key first so "gpt-5.5-mini" beats "gpt-5.5".
for key, value in sorted(budgets.items(), key=lambda kv: -len(kv[0])):
if model_id.startswith(key):
return int(value)
return self.valves.default_budget_tokens
@staticmethod
def _repair_tool_calls(other_msgs: list[dict]) -> list[dict]:
while other_msgs and other_msgs[0].get("role") == "tool":
other_msgs.pop(0)
if (
other_msgs
and other_msgs[0].get("role") == "assistant"
and other_msgs[0].get("tool_calls")
):
expected = {tc.get("id") for tc in other_msgs[0]["tool_calls"]}
seen = {
m.get("tool_call_id")
for m in other_msgs[1:]
if m.get("role") == "tool"
}
if not expected.issubset(seen):
other_msgs.pop(0)
return other_msgs
# ---- inlet -------------------------------------------------------------
async def inlet(self, body: dict) -> dict:
messages = body.get("messages", [])
if not messages:
return body
model_id = body.get("model", "") or ""
budget = self._budget_for(model_id) - self.valves.response_headroom_tokens
if budget <= 0:
return body # Misconfigured — don't mangle the request, let the provider reject.
system_msgs = [m for m in messages if m.get("role") == "system"]
other_msgs = [m for m in messages if m.get("role") != "system"]
used = sum(self._message_tokens(m, model_id) for m in system_msgs + other_msgs)
# Drop oldest non-system messages one at a time until we're under budget
# or nothing is left to drop. System messages stay put; if they alone
# already exceed the budget, the provider will reject the request and
# that's the right signal (the admin needs to shrink the system prompt).
while used > budget and other_msgs:
dropped = other_msgs.pop(0)
used -= self._message_tokens(dropped, model_id)
other_msgs = self._repair_tool_calls(other_msgs)
body["messages"] = system_msgs + other_msgs
return bodyA few things worth noticing:
- Configure once, run everywhere. Set this filter as a global filter in Admin Panel → Functions. The
model_budgets_jsonvalve lets you enumerate every model you care about; anything else falls back todefault_budget_tokens. Admins can tune budgets at runtime without touching code. - Prefix match on model id, longest-first.
claude-sonnet-4-6-20260514transparently uses theclaude-sonnet-4-6budget. If you list nested ids likegpt-5.5andgpt-5.5-mini,_budget_forsorts keys by length descending before the prefix loop so the more specific one wins; otherwise dict insertion order would decide andgpt-5.5could shadowgpt-5.5-minifor anyone who listed it first. - One tokenizer, every model. Counts come from
tiktoken, loaded once and cached.encoding_for_model(model_id)returns the model's own encoding when the installed tiktoken recognizes the id, otherwise it falls back totiktoken_encoding(defaultcl100k_base). tiktoken only ships OpenAI encodings, and only for models older than its release, so Claude, Gemini, Qwen, DeepSeek and any brand-new OpenAI model all use the fallback. That is an approximation, which is more than good enough for a trim budget with headroom.cl100k_baseis the one Open WebUI pre-caches, so it works offline; the newero200k_baseis not pre-cached and is fetched on first use, so an air-gapped box either needs it inTIKTOKEN_CACHE_DIRor falls back tocl100k_baseautomatically. - Images get a worst-case allowance; audio and files do not. Open WebUI hands images to the model as
image_urlcontent parts, and the filter can't see their resolution without decoding the base64, so it charges a flattokens_per_image(default 1600) for each one rather than undercounting and busting the window. That default is sized for a 4K, high-detail upload: OpenAI high-detail tops out near 1445 tokens (85 + 170 per 512px tile), Claude near 1590 (about width × height / 750 after its resize). Gemini high-resolution tiling can run higher, so raise the valve if you lean on Gemini with many images. Audio and file parts still count as zero; add their own allowance the same way if you need it. - Same tool-call repair. Reused from the first example. This is the block that keeps the request valid after trimming.
- Fail-open when misconfigured. If you somehow set the headroom larger than the budget, the filter passes the request through untouched rather than wiping the conversation. The provider's error is better than a silent delete.
Open WebUI doesn't always present the raw provider id to body["model"]. If an admin sets a connection prefix_id, every model is wrapped as {prefix}.{raw_id} (e.g. openai.gpt-5.5). Pipe-function manifolds wrap their sub-models as {pipe.id}.{sub_id} (e.g. anthropic.claude-sonnet-4-6-20260514). Custom Workspace models can have arbitrary ids, often UUIDs.
Copy the exact id shown in the model picker into model_budgets_json — not the upstream provider's id. If you get the format wrong, requests silently land on default_budget_tokens and you won't notice until a chat that fits a real budget fails to fit the fallback.
This filter runs in inlet(), which is before Open WebUI's RAG retrieval (chat_completion_files_handler) and before native-tool definitions are attached to the payload. Both can add non-trivial bytes to the request after the filter has trimmed. If you rely on Knowledge bases or if your models have heavy built-in tool specs (web search + memory + code interpreter + MCP servers + …), reserve extra headroom by bumping response_headroom_tokens — it doubles as a general "leave room for post-filter additions" budget.
This example already counts with tiktoken (via encoding_for_model(model_id) and a cached fallback encoding), which is exact for the OpenAI models tiktoken recognizes and a solid approximation for everything else. If you need higher-fidelity counts for a non-OpenAI provider, swap _estimate_tokens for that provider's own tokenizer (Anthropic's, Gemini's, or a transformers tokenizer for a local model). For a trim budget the tiktoken approximation stays close enough to keep you safely under the limit, as long as you've left enough headroom for the RAG / tool additions above.
You almost certainly want a community filter, not this one
The two examples on this page are deliberately minimal — they exist to show the shape of the inlet() hook and to teach the one non-obvious detail (tool-call repair). For a real deployment, don't write your own from scratch and don't ship these as-is. Go browse the Open WebUI Community and pick a context-management filter someone else has already battle-tested.
Production-grade community filters typically handle things the minimal examples above skip:
- Real tokenizers per provider —
tiktokenfor OpenAI, Anthropic's tokenizer for Claude, Gemini's for Google,transformerstokenizers for local models. The example on this page uses one tiktoken encoding for every model (exact for OpenAI, approximate elsewhere); a production filter uses each provider's own tokenizer. - Proper image / audio / file token accounting — exact per-provider, per-resolution counts for every content-part type, instead of one worst-case constant for images and zero for audio and files.
- Summarize-and-replace strategies — when the window is about to overflow, call a cheap model to summarize the oldest block and replace it with one summary message, preserving long-running context instead of silently forgetting.
- Per-user / per-role policies — power users get larger budgets than free users; service accounts get different defaults than humans.
- Per-model-family policies — more intelligent than a prefix match (e.g. recognize all Claude 3.x Sonnet variants via a regex or metadata).
- Tool-result-first or attachment-first trimming — drop the giant scraped web pages and RAG citations from old turns before touching dialogue.
- Sliding-window summarization with checkpoints — keep running summaries stored in
__metadata__across turns so you don't re-summarize on every request. - Hard message caps and user-facing errors — refuse a request with a friendly "this chat is too long, please start a new one" event-emitter message instead of silently dropping context.
- Observability hooks — log every trim decision to Langfuse, OpenLit, or your stack of choice so you can audit what the filter actually did.
- Configurable valves for everything — admins tune everything at runtime without touching code.
None of that is hard to do, but all of it together is a week of work if you're starting from one of the minimal examples above. Someone on the community site has almost certainly already done it. Search first.
When you're shopping for a context-management filter, look for names like context window, trim, summarize, conversation length, token budget, history limiter, and the provider name of the models you use. Sort by popularity on the community site — the top-downloaded filters tend to be the ones that already solved the edge cases you haven't hit yet.
What users will experience
- With a filter in place, old turns are silently removed / summarized / replaced before the request reaches the model. The user keeps chatting as normal. The model simply "forgets" older history according to your policy.
- Without a filter, long conversations will eventually hit the provider's context limit and return the "prompt is too long" error. Users will need to start a new chat.
Both are valid UX choices. Pick the one that matches your deployment.
Related
- Filter Functions — the full reference for
inlet()/stream()/outlet() - Open WebUI Community — browse and install community-built filters, including context-management ones
- Chat Parameters — per-chat, per-user, and per-model parameter precedence