Multi-Replica, High Availability & Concurrency Troubleshooting
This guide addresses common issues encountered when deploying Open WebUI in multi-replica environments (e.g., Kubernetes, Docker Swarm) or when using multiple workers (UVICORN_WORKERS > 1) for increased concurrency.
Core Requirements Checklist
Before troubleshooting specific errors, ensure your deployment meets these absolute requirements for a multi-replica setup. Missing any of these will cause instability, login loops, or data loss.
- Shared Secret Key:
WEBUI_SECRET_KEYMUST be identical on all replicas. - External Database: You MUST use an external PostgreSQL database (see
DATABASE_URL). SQLite is NOT supported for multiple instances. - Redis for WebSockets:
ENABLE_WEBSOCKET_SUPPORT=TrueandWEBSOCKET_MANAGER=rediswith a validWEBSOCKET_REDIS_URLare required. - Shared Storage: A persistent volume (RWX / ReadWriteMany if possible, or ensuring all replicas map to the same underlying storage for
data/) is critical for RAG (uploads/vectors) and generated images. - External Vector Database (Recommended): While embedded Chroma works with shared storage, using a dedicated external Vector DB (e.g., PGVector, Milvus, Qdrant) is highly recommended to avoid file locking issues and improve performance.
Common Issues
1. Login Loops / 401 Unauthorized Errors
Symptoms:
- You log in successfully, but the next click logs you out.
- You see "Unauthorized" or "401" errors in the browser console immediately after login.
- "Error decrypting tokens" appears in logs.
Cause:
Each replica is using a different WEBUI_SECRET_KEY. When Replica A issues a session token (JWT), Replica B rejects it because it cannot verify the signature with its own different key.
Solution:
Set the WEBUI_SECRET_KEY environment variable to the same strong, random string on all backend replicas.
# Example in Kubernetes/Compose
env:
- name: WEBUI_SECRET_KEY
value: "your-super-secure-static-key-here"
2. WebSocket 403 Errors / Connection Failures
Symptoms:
- Chat stops responding or hangs.
- Browser console shows
WebSocket connection failed: 403 ForbiddenorConnection closed. - Logs show
engineio.server: https://your-domain.com is not an accepted origin.
Cause:
- CORS: The load balancer or ingress origin does not match the allowed origins.
- Missing Redis: WebSockets are defaulting to in-memory, so events on Replica A (e.g., LLM generation finish) are not broadcast to the user connected to Replica B.
Solution:
-
Configure CORS: Ensure
CORS_ALLOW_ORIGINincludes your public domain and http/https variations.If you see logs like
engineio.base_server:_log_error_once:354 - https://yourdomain.com is not an accepted origin, you must update this variable. It accepts a semicolon-separated list of allowed origins.Example:
CORS_ALLOW_ORIGIN="https://chat.yourdomain.com;http://chat.yourdomain.com;https://yourhostname;http://localhost:3000"Add all valid IPs, Domains, and Hostnames that users might use to access your Open WebUI.
-
Enable Redis for WebSockets: Ensure these variables are set on all replicas:
ENABLE_WEBSOCKET_SUPPORT=True
WEBSOCKET_MANAGER=redis
WEBSOCKET_REDIS_URL=redis://your-redis-host:6379/0
3. "Model Not Found" or Configuration Mismatch
Symptoms:
- You enable a model or change a setting in the Admin UI, but other users (or you, after a refresh) don't see the change.
- Chats fail with "Model not found" intermittently.
Cause:
- Configuration Sync: Replicas are not synced. Open WebUI uses Redis Pub/Sub to broadcast configuration changes (like toggling a model) to all other instances.
- Missing Redis: If
REDIS_URLis not set, configuration changes stay local to the instance where the change was made.
Solution:
Set REDIS_URL to point to your shared Redis instance. This enables the Pub/Sub mechanism for real-time config syncing.
REDIS_URL=redis://your-redis-host:6379/0
4. Database Corruption / "Locked" Errors
Symptoms:
- Logs show
database is lockedor severe SQL errors. - Data saved on one instance disappears on another.
Cause: Using SQLite with multiple replicas. SQLite is a file-based database and does not support concurrent network writes from multiple containers.
Solution: Migrate to PostgreSQL. Update your connection string:
DATABASE_URL=postgresql://user:password@postgres-host:5432/openwebui
5. Uploaded Files or RAG Knowledge Inaccessible
Symptoms:
- You upload a file (for RAG) on one instance, but the model cannot find it later.
- Generated images appear as broken links.
Cause:
The /app/backend/data directory is not shared or is not consistent across replicas. If User A uploads a file to Replica 1, and the next request hits Replica 2, Replica 2 won't have the file physically on disk.
Solution:
- Kubernetes: Use a
PersistentVolumeClaimwithReadWriteMany(RWX) access mode if your storage provider supports it (e.g., NFS, CephFS, AWS EFS). - Docker Swarm/Compose: Mount a shared volume (e.g., NFS mount) to
/app/backend/dataon all containers.
Deployment Best Practices
Updates and Migrations
Always scale down to 1 replica (and 1 worker) before upgrading Open WebUI versions.
Database migrations run automatically on startup. If multiple replicas (or multiple workers within a single container) start simultaneously with a new version, they may try to run migrations concurrently, leading to race conditions or database schema corruption.
Safe Update Procedure:
- Scale Down: Set replicas to
1(and ensureUVICORN_WORKERS=1if you customized it). - Update Image: Application restarts with the new version.
- Wait for Health Check: Ensure the single instance starts up fully and completes DB migrations.
- Scale Up: Increase replicas (or
UVICORN_WORKERS) back to your desired count.
Session Affinity (Sticky Sessions)
While Open WebUI is designed to be stateless with proper Redis configuration, enabling Session Affinity (Sticky Sessions) at your Load Balancer / Ingress level can improve performance and reduce occasional jitter in WebSocket connections.
- Nginx Ingress:
nginx.ingress.kubernetes.io/affinity: "cookie" - AWS ALB: Enable Target Group Stickiness.