"The prompt is too long" / Model Context Length Exceeded
What you're seeing
Errors like:
The prompt is too long: 207601, model maximum context length: 202751This model's maximum context length is 128000 tokens. However, your messages resulted in …Input is too long for the modelcontext length exceeded
These come from the model provider (OpenAI, Anthropic, Google, your Ollama server, GLM-4/5.x, etc.), not from Open WebUI. The provider counted the tokens of everything you sent and rejected the request because it exceeds the model's context window.
Why it happens
The "prompt" a model sees is the entire conversation — not just the message you just typed. Every time you send a new message, Open WebUI forwards:
- Your system prompt
- The full chat history (every previous user/assistant turn in that conversation)
- Any attached files that are inlined into context (not retrieved via RAG)
- Any tool definitions and prior tool call results
- Any inlet-injected context (from filters, RAG, web search, memories, etc.)
- Your newest user message
As a chat grows, the history grows. Large attachments or long tool-call outputs can eat the entire window in a single turn. Once the sum of all of that exceeds the model's context window, the provider rejects the request.
Why Open WebUI doesn't auto-truncate for you
Open WebUI intentionally does not ship a built-in context trimmer. This is a design choice, not an oversight, and it is unlikely to change. Here's why:
- Every model uses a different tokenizer. The token count for the same text differs between OpenAI (tiktoken), Anthropic, Gemini, GLM, Llama-family, Mistral, Qwen, and so on. A truly correct trimmer would need a per-model tokenizer for every provider in existence. Getting that wrong ships silent data corruption.
- Every model has a different context window. 8k, 32k, 128k, 200k, 1M — and that's before you factor in reserved output tokens, provider-side overhead, and multimodal content.
- Everyone wants a different truncation policy. We have seen users ask for all of the following, and all of them are reasonable:
- Trim by token count.
- Trim by number of messages.
- Trim by number of conversational turns.
- Trim only non-system, non-assistant messages.
- Trim file attachments first, keep the dialogue.
- Trim tool-call results first, keep everything else.
- Set a hard ceiling on chat length (block further messages beyond N turns).
- Summarize older messages instead of dropping them, and replace the dropped block with the summary.
- Per-model policies (keep 1M tokens for Gemini, 128k for GPT-4, 32k for smaller local models).
There is no single policy that is correct for every deployment, every user, and every model. A built-in implementation would be wrong for most users by definition, and would hide the much better option: give the user the hook and let them pick.
The supported way: use a filter Function
Context management in Open WebUI is done with filter Functions. inlet() runs on every request before the payload is sent to the model — it receives the full body (including body["messages"]) and can modify it freely. That is the hook you use.
Typical approaches, in increasing order of sophistication:
- Hard chat-length cap. Refuse or error if
len(body["messages"]) > N. Simple and predictable; no tokenization needed. - Newest-N-turns window. Keep the system prompt and only the most recent N user/assistant turns; drop the older ones.
- Token-budget window, per model. Estimate tokens per message (e.g., with
tiktokenfor OpenAI-family or a char/4 heuristic for others) and trim from the oldest non-system message until the total fits the model's window. - Summarize-and-replace. When the window is about to overflow, call a cheap model to summarize the oldest block of messages, then replace that block with a single assistant-authored summary message. Preserves long-running context without busting the window.
- Attachment- or tool-output-first trimming. Strip large file contents or tool results from old turns before touching the dialogue.
Community filters for most of these already exist on the Open WebUI community site. Install one, configure its valves, and you're done. If none fits your policy exactly, copy the closest one into the Functions admin page and edit it — filters are pure Python and easy to tweak.
Minimal example: "newest N turns" filter
Show the full filter code (keeps the last N non-system messages)
from pydantic import BaseModel, Field
class Filter:
class Valves(BaseModel):
priority: int = Field(
default=0,
description="Run before other filters that depend on the final message list.",
)
max_turns: int = Field(
default=20,
description="Maximum number of non-system messages to keep (older are dropped).",
)
def __init__(self):
self.valves = self.Valves()
async def inlet(self, body: dict) -> dict:
messages = body.get("messages", [])
if not messages:
return body
system_msgs = [m for m in messages if m.get("role") == "system"]
other_msgs = [m for m in messages if m.get("role") != "system"]
if len(other_msgs) > self.valves.max_turns:
other_msgs = other_msgs[-self.valves.max_turns :]
# Tool-call repair: after slicing, the new leading messages
# might be orphaned tool-call results or an assistant whose
# tool_calls reference tool messages that got dropped.
# Providers (OpenAI / Anthropic / …) 400 on those — so prune
# until the window starts on something the provider accepts.
while other_msgs and other_msgs[0].get("role") == "tool":
other_msgs.pop(0)
if (
other_msgs
and other_msgs[0].get("role") == "assistant"
and other_msgs[0].get("tool_calls")
):
expected = {tc.get("id") for tc in other_msgs[0]["tool_calls"]}
seen = {
m.get("tool_call_id")
for m in other_msgs[1:]
if m.get("role") == "tool"
}
if not expected.issubset(seen):
other_msgs.pop(0)
body["messages"] = system_msgs + other_msgs
return bodyEnable this filter globally or attach it to specific models in Admin Panel → Functions. The max_turns valve is configurable per-model via the model card, so you can set a smaller window for local 8k models and a larger one for Gemini 1M.
With tool calling on, an assistant message that invokes tools is paired with one or more tool messages carrying results that share the same tool_call_id. If max_turns happens to slice the conversation in the middle of that pair — keeping the orphan half — the upstream provider returns a 400 because the tool call / result structure is invalid. The repair block drops the orphans so the window always starts on a clean boundary. This matches what production community filters for context management do; the rest of the filter is the generic trimming logic.
Slightly more involved: per-model token budget
Counting turns is easy to reason about but wrong in practice — 40 turns of one-liners fit in 8k tokens, five turns with a 200-page PDF attachment do not. The more useful policy is "keep everything until we're about to bust the model's context window, then drop the oldest non-system messages until we fit."
This second example does that. It:
- Estimates tokens from character length (cheap heuristic, no dependencies; swap in
tiktokenor a real tokenizer if you want strict counts). - Reads per-model budgets from a valve, so a single instance of the filter works for your 8k local model and your 1M Gemini at the same time.
- Leaves a configurable headroom for the response.
- Re-applies the tool-call repair from the first example after trimming.
Show the full filter code (per-model token-budget trimmer)
import json
from pydantic import BaseModel, Field
class Filter:
class Valves(BaseModel):
priority: int = Field(
default=0,
description="Run before other filters that depend on the final message list.",
)
default_budget_tokens: int = Field(
default=8000,
description="Fallback input-token budget for any model not listed in model_budgets.",
)
response_headroom_tokens: int = Field(
default=2000,
description="Tokens to reserve for the model's reply. Trimmed from the budget before fitting.",
)
model_budgets_json: str = Field(
default=(
'{\n'
' "gpt-4o": 120000,\n'
' "gpt-4o-mini": 120000,\n'
' "claude-3-5-sonnet": 180000,\n'
' "gemini-1.5-pro": 900000,\n'
' "llama3.1:8b": 6000\n'
'}'
),
description="JSON mapping of model id (or prefix) to input-token budget.",
)
def __init__(self):
self.valves = self.Valves()
# ---- helpers -----------------------------------------------------------
@staticmethod
def _estimate_tokens(content) -> int:
"""~4 chars per token is close enough for a trim budget.
For strict counts, replace with tiktoken or a provider tokenizer."""
if content is None:
return 0
if isinstance(content, str):
return max(1, len(content) // 4)
# Some providers deliver multimodal content as a list of parts.
if isinstance(content, list):
return sum(
Filter._estimate_tokens(part.get("text", "")) if isinstance(part, dict) else 0
for part in content
)
return 0
def _message_tokens(self, msg: dict) -> int:
# Content + a small per-message overhead for role/formatting.
tokens = self._estimate_tokens(msg.get("content"))
# Tool calls carry arguments in JSON; count them too.
for tc in msg.get("tool_calls") or []:
args = tc.get("function", {}).get("arguments", "")
tokens += self._estimate_tokens(args)
return tokens + 4
def _budget_for(self, model_id: str) -> int:
try:
budgets = json.loads(self.valves.model_budgets_json or "{}")
except Exception:
budgets = {}
if model_id in budgets:
return int(budgets[model_id])
# Allow prefix match — "gpt-4o-2024-11-20" uses the "gpt-4o" budget.
# Sort by key length descending so more specific prefixes win:
# "gpt-4o-mini" must match before "gpt-4o".
for key, value in sorted(budgets.items(), key=lambda kv: -len(kv[0])):
if model_id.startswith(key):
return int(value)
return self.valves.default_budget_tokens
@staticmethod
def _repair_tool_calls(other_msgs: list[dict]) -> list[dict]:
while other_msgs and other_msgs[0].get("role") == "tool":
other_msgs.pop(0)
if (
other_msgs
and other_msgs[0].get("role") == "assistant"
and other_msgs[0].get("tool_calls")
):
expected = {tc.get("id") for tc in other_msgs[0]["tool_calls"]}
seen = {
m.get("tool_call_id")
for m in other_msgs[1:]
if m.get("role") == "tool"
}
if not expected.issubset(seen):
other_msgs.pop(0)
return other_msgs
# ---- inlet -------------------------------------------------------------
async def inlet(self, body: dict) -> dict:
messages = body.get("messages", [])
if not messages:
return body
model_id = body.get("model", "") or ""
budget = self._budget_for(model_id) - self.valves.response_headroom_tokens
if budget <= 0:
return body # Misconfigured — don't mangle the request, let the provider reject.
system_msgs = [m for m in messages if m.get("role") == "system"]
other_msgs = [m for m in messages if m.get("role") != "system"]
used = sum(self._message_tokens(m) for m in system_msgs + other_msgs)
# Drop oldest non-system messages one at a time until we're under budget
# or nothing is left to drop. System messages stay put; if they alone
# already exceed the budget, the provider will reject the request and
# that's the right signal (the admin needs to shrink the system prompt).
while used > budget and other_msgs:
dropped = other_msgs.pop(0)
used -= self._message_tokens(dropped)
other_msgs = self._repair_tool_calls(other_msgs)
body["messages"] = system_msgs + other_msgs
return bodyA few things worth noticing:
- Configure once, run everywhere. Set this filter as a global filter in Admin Panel → Functions. The
model_budgets_jsonvalve lets you enumerate every model you care about; anything else falls back todefault_budget_tokens. Admins can tune budgets at runtime without touching code. - Prefix match on model id, longest-first.
gpt-4o-2024-11-20transparently uses thegpt-4obudget, andgpt-4o-mini-2024-07-18correctly uses thegpt-4o-minibudget (more specific wins). The_budget_forhelper sorts keys by length descending before the prefix loop — otherwise dict insertion order would decide, and"gpt-4o"would shadow"gpt-4o-mini"for anyone who listed it first. - Multimodal content is partially counted. The estimator walks list-of-parts content and sums text parts. Image / audio / file parts count as zero. For a char/4 heuristic that's fine for a trim budget, but if you rely heavily on image inputs with small providers (e.g. a local 8k vision model), add a per-image allowance inside
_estimate_tokens(something like 255 tokens per image is a reasonable start). - Same tool-call repair. Reused from the first example. This is the block that keeps the request valid after trimming.
- Fail-open when misconfigured. If you somehow set the headroom larger than the budget, the filter passes the request through untouched rather than wiping the conversation. The provider's error is better than a silent delete.
Open WebUI doesn't always present the raw provider id to body["model"]. If an admin sets a connection prefix_id, every model is wrapped as {prefix}.{raw_id} (e.g. openai.gpt-4o). Pipe-function manifolds wrap their sub-models as {pipe.id}.{sub_id} (e.g. anthropic.claude-3-5-sonnet-20241022). Custom Workspace models can have arbitrary ids, often UUIDs.
Copy the exact id shown in the model picker into model_budgets_json — not the upstream provider's id. If you get the format wrong, requests silently land on default_budget_tokens and you won't notice until a chat that fits a real budget fails to fit the fallback.
This filter runs in inlet(), which is before Open WebUI's RAG retrieval (chat_completion_files_handler) and before native-tool definitions are attached to the payload. Both can add non-trivial bytes to the request after the filter has trimmed. If you rely on Knowledge bases or if your models have heavy built-in tool specs (web search + memory + code interpreter + MCP servers + …), reserve extra headroom by bumping response_headroom_tokens — it doubles as a general "leave room for post-filter additions" budget.
If you need higher-fidelity token counting, swap _estimate_tokens for tiktoken.encoding_for_model(model_id).encode(text) (OpenAI-family) or your provider's own tokenizer. For everything else — Anthropic, Gemini, local models — the char/4 heuristic is close enough to keep you safely under the limit, as long as you've left enough headroom for the RAG / tool additions above.
You almost certainly want a community filter, not this one
The two examples on this page are deliberately minimal — they exist to show the shape of the inlet() hook and to teach the one non-obvious detail (tool-call repair). For a real deployment, don't write your own from scratch and don't ship these as-is. Go browse the Open WebUI Community and pick a context-management filter someone else has already battle-tested.
Production-grade community filters typically handle things the minimal examples above skip:
- Real tokenizers per provider —
tiktokenfor OpenAI, Anthropic's tokenizer for Claude, Gemini's for Google,transformerstokenizers for local models. Not char/4 heuristics. - Proper image / audio / file token accounting — provider-specific allowances for every content-part type, not "zero."
- Summarize-and-replace strategies — when the window is about to overflow, call a cheap model to summarize the oldest block and replace it with one summary message, preserving long-running context instead of silently forgetting.
- Per-user / per-role policies — power users get larger budgets than free users; service accounts get different defaults than humans.
- Per-model-family policies — more intelligent than a prefix match (e.g. recognize all Claude 3.x Sonnet variants via a regex or metadata).
- Tool-result-first or attachment-first trimming — drop the giant scraped web pages and RAG citations from old turns before touching dialogue.
- Sliding-window summarization with checkpoints — keep running summaries stored in
__metadata__across turns so you don't re-summarize on every request. - Hard message caps and user-facing errors — refuse a request with a friendly "this chat is too long, please start a new one" event-emitter message instead of silently dropping context.
- Observability hooks — log every trim decision to Langfuse, OpenLit, or your stack of choice so you can audit what the filter actually did.
- Configurable valves for everything — admins tune everything at runtime without touching code.
None of that is hard to do, but all of it together is a week of work if you're starting from one of the minimal examples above. Someone on the community site has almost certainly already done it. Search first.
When you're shopping for a context-management filter, look for names like context window, trim, summarize, conversation length, token budget, history limiter, and the provider name of the models you use. Sort by popularity on the community site — the top-downloaded filters tend to be the ones that already solved the edge cases you haven't hit yet.
What users will experience
- With a filter in place, old turns are silently removed / summarized / replaced before the request reaches the model. The user keeps chatting as normal. The model simply "forgets" older history according to your policy.
- Without a filter, long conversations will eventually hit the provider's context limit and return the "prompt is too long" error. Users will need to start a new chat.
Both are valid UX choices. Pick the one that matches your deployment.
Related
- Filter Functions — the full reference for
inlet()/stream()/outlet() - Open WebUI Community — browse and install community-built filters, including context-management ones
- Chat Parameters — per-chat, per-user, and per-model parameter precedence