✨ Autocomplete
Open WebUI offers an AI-powered Autocomplete feature that suggests text completions in real-time as you type your prompt. It acts like a "Copilot" for your chat input, helping you craft prompts faster using your configured task model.
How It Works
When enabled, Open WebUI monitors your input in the chat box. When you pause typing, it sends your current text to a lightweight Task Model. This model predicts the likely next words or sentences, which appear as "ghost text" overlaying your input.
- Accept Suggestion: Press
Tab(or theRight Arrowkey) to accept the suggestion. - Reject/Ignore: Simply keep typing to overwrite the suggestion.
Performance Recommendation
Autocomplete functionality relies heavily on the response speed of your Task Model. We recommend using a small, fast, non-reasoning model to ensure suggestions appear instantly.
Recommended Models:
- Llama 3.2 (1B or 3B)
- Qwen 3 (0.6B or 3B)
- Gemma 3 (1B or 4B)
- GPT-5 Nano (Optimized for low latency)
Avoid using "Reasoning" models (e.g., o1, o3) or heavy Chain-of-Thought models for this feature, as the latency will make the autocomplete experience sluggish.
Configuration
The Autocomplete feature is controlled by a two-layer system: Global availability and User preference.
1. Global Configuration (Admin)
Admins control whether the autocomplete feature is available on the server.
1. Configuring Autocomplete (Global)
Admin Panel Settings: Go to Admin Settings > Interface > Task Model and toggle Autocomplete Generation.
2. User Configuration (Personal)
Even if enabled globally, individual users can turn it off for themselves if they find it distracting.
- Go to Settings > Interface.
- Toggle Autocomplete Generation.
If the Admin has disabled Autocomplete globally, users will not be able to enable it in their personal settings.
Performance & Troubleshooting
Why aren't suggestions appearing?
- Check Settings: Ensure it is enabled in both Admin and User settings.
- Task Model: Go to Admin Settings > Interface and verify a Task Model is selected. If no model is selected, the feature cannot generate predictions.
- Latency: If your Task Model is large or running on slow hardware, predictions might arrive too late to be useful. Switch to a smaller model.
- Reasoning Models: Ensure you are not using a "Reasoning" model (like o1 or o3), as their internal thought process creates excessive latency that breaks real-time autocomplete.
Performance Impact
Autocomplete sends a request to your LLM essentially every time you pause typing (debounced).
- Local Models: This can consume significant GPU/CPU resources on the host machine.
- API Providers: This will generate a high volume of API calls (though usually with very short token counts). Be mindful of your provider's Rate Limits (Requests Per Minute/RPM and Tokens Per Minute/TPM) to avoid interruptions.
For multi-user instances running on limited local hardware, we recommend disabling Autocomplete to prioritize resources for actual chat generation.