Self-Hosted Local AI Ecosystem Guide (2026)
Running AI locally is still best understood as two layers:
- Backend: the inference engine that loads model weights and exposes an API.
- Frontend: the interface/workspace you interact with (chat, files, tools, and RAG workflows).
In 2025–2026, many tools blurred the old line by bundling lightweight runners and RAG layers into desktop apps. But the practical architecture is still the same: backends determine performance; frontends determine workflow.
Updated April 2026 with 2025–2026 ecosystem guidance (bundled for beginners, modular for power users).
2026 strategy in one minute: bundled vs modular
The default strategy is now:
- Beginners: start with all-in-one apps (LM Studio, Jan, GPT4All).
- Power users: separate layers (for example, vLLM or Ollama as backend + Open WebUI or LibreChat as frontend).
If you remember one thing: bundled apps are excellent for day one; modular stacks win for upgrade flexibility, remote access, and serving one model to multiple devices.
Quick start
- Pick a backend that fits your hardware and model format.
- Pick a frontend to chat or manage models.
- Or choose an all-in-one app that bundles both.
- Download a model from a catalog you trust.
- Check licenses (model + app). Bind services to
localhostunless you add auth.
Windows vs Linux: what to install first
Windows
Discrete NVIDIA GPU (desktop/server)
- Use Ollama, LM Studio, llama.cpp, or text-generation-webui.
- Ollama now ships an official desktop app for Windows/macOS with GUI chat, history, and drag-and-drop files, while still exposing the same local API/CLI.
- For high-throughput multi-model serving, run vLLM on WSL2 or a Linux host and expose an OpenAI-style endpoint.
Laptops / mini PCs with iGPU (AMD/Intel) or Intel Arc
- Prefer LM Studio or llama.cpp / KoboldCpp builds that use Vulkan. LM Studio can offload layers to AMD and Intel iGPUs via Vulkan, which is significantly faster than CPU-only on integrated-graphics systems.
- Start with 7B–13B GGUF models; 20B (e.g., gpt-oss-20b) is realistic on 16 GB-class machines with good offload.
Ryzen AI laptops / NPUs
- Consider AMD Gaia: GUI + CLI that runs local LLM agents on Ryzen AI NPU + iGPU, using a RAG pipeline and an OpenAI-compatible REST API. It also runs on non-Ryzen systems (at lower performance).
Frontends
- Open WebUI or LibreChat pointed at your local endpoint. Open WebUI is now closer to a self-hosted “ChatGPT + team chat” for your own models (channels, DMs, knowledge bases, tools) and uses a custom BSD-3-based license with a branding clause in v0.6.x.
- Page-level browser extensions (e.g., Page Assist) can talk to Ollama/LM Studio APIs if you prefer in-browser chat.
Linux
NVIDIA (server / homelab)
- vLLM (v1) and Text Generation Inference (TGI) 3.x are the standard high-throughput OpenAI-style servers. vLLM focuses on efficient serving; recent releases add architectural speed-ups and improved multimodal support. TGI adds multi-backend support (TensorRT-LLM, vLLM, etc.).
- For simpler setups, Ollama or llama.cpp remain practical single-node servers.
AMD GPUs
- Use ROCm builds when available (vLLM/TGI/llama.cpp), or KoboldCpp / llama.cpp with Vulkan.
- Gaia has Linux support as well, but its sweet spot is Ryzen AI laptops with NPU + iGPU.
Use Docker for servers. Bind to
127.0.0.1and put a UI (Open WebUI, LibreChat, etc.) in front.
Rule of thumb: VRAM caps the model size. Smaller quantized models still beat larger high-precision models on weak hardware. Start with 7B, then move up.
Frontends (interfaces)
Frontends are thin clients from the user’s perspective. They connect to whatever backend you run (local engines, remote APIs, or both). Some now bundle light backends, RAG, and multi-user features, but you still point them at LLM endpoints.
| App | OS | Connects to | Notes | |
|---|---|---|---|---|
| Open WebUI | Win, macOS, Linux | OpenAI-style endpoints, Ollama, llama.cpp, vLLM, TGI, LM Studio | “Gold standard” self-hosted ChatGPT-style UI: multi-user accounts, RAG/knowledge bases, and Functions/plugins. | (Open WebUI ) |
| SillyTavern | Win, macOS, Linux | KoboldAI/KoboldCpp, text-gen-webui, Ollama, OpenAI-style | Roleplay/creative focus with character cards and world/lore tooling. | |
| LibreChat | Win, macOS, Linux | OpenAI-style endpoints | Power-user friendly; broad local/cloud API support and high customization. | (LibreChat ) |
| Kobold Lite | Any browser | KoboldAI/KoboldCpp, AI Horde | Zero-install client | (lite.koboldai.net , GitHub ) |
| KoboldAI Client | Win, macOS, Linux | Local or remote LLM backends | Story-writing UI | (GitHub ) |
| AnythingLLM | Win, macOS, Linux | Ollama or APIs | RAG-first workflow for “chat with your docs” with built-in vector workflows. | (anythingllm.com , GitHub ) |
| Enchanted | iOS, macOS | Ollama, OpenAI-style APIs | Clean native mobile/desktop Apple client for local servers over home networks. | |
| LM Studio (UI) | Win, macOS | Built-in local server, OpenAI-style | Catalog for GPT-OSS/Qwen3/Gemma3/DeepSeek; Vulkan iGPU offload; exposes local OpenAI API; SDKs | (LM Studio ) |
| Jan | Win, macOS, Linux | Built-in local server, OpenAI-style | Offline-first desktop app, supports modern open-weight models | (Jan ) |
| GPT4All Desktop | Win, macOS, Linux | Built-in local server | Private, on-device; large local model catalog | (Nomic AI , docs.gpt4all.io ) |
Backends (engines)
Engines that load models and expose a local API.
| App | OS | GPU accel | VRAM (typical) | Models / Formats |
|---|---|---|---|---|
| llama.cpp (llama-server) | Win, macOS, Linux | CUDA, Metal, HIP/ROCm, Vulkan, SYCL | 7B q4 ≈ 4 GB; 13B q4 ≈ 8 GB | GGUF, OpenAI-style server. GGUF is the native format. (GitHub ) |
| vLLM | Linux, Win WSL | CUDA, ROCm | Model dependent | Transformers; high-throughput OpenAI-style server; 2025 v1 architecture improves throughput + multimodal. (VLLM Documentation ) |
| Text Generation Inference (TGI) | Linux | CUDA, ROCm | Model dependent | HF production server; 3.x adds multi-backend support (TensorRT-LLM, vLLM) and mature deployment tooling. (Hugging Face ) |
| KoboldCpp | Win, macOS, Linux | CUDA, ROCm, Metal, Vulkan | 7B q4 ≈ 4 GB | GGUF, Kobold API; focus on story/RP workloads. (GitHub ) |
| MLX LLM | macOS (Apple Silicon) | Apple MLX | Model dependent | MLX or GGUF-converted |
| TensorRT-LLM | Linux | NVIDIA TensorRT | High for fp16 | Transformers; max-throughput NVIDIA deployment |
| LocalAI | Win, macOS, Linux | CPU/GPU (runtime dependent) | Model dependent | OpenAI-compatible local server with multi-modal support (LLM, image, audio) in one deployable stack. |
| ExLlamaV2 | Linux, Win (community) | NVIDIA CUDA | Model dependent | High-performance 4/8-bit inference for EXL2/GPTQ-style workflows on NVIDIA GPUs. |
Backend picks in 2026 (quick map)
- Ollama: easiest general-purpose local engine with CLI + service + desktop UX.
- llama.cpp: broadest compatibility baseline across CPU/GPU vendors and operating systems.
- vLLM: best fit for high-throughput, multi-user Linux/WSL2 GPU serving.
- ExLlama (EXL2): fastest NVIDIA-centric route for Llama-family models in EXL2 workflows.
- MLX-LM / MLX stack: strongest local-native path on Apple Silicon.
Both: frontend + backend in one
| App | Form | OS | GPU accel | VRAM (typical) | Models / Formats |
|---|---|---|---|---|---|
| Ollama | CLI + API + GUI | Win, macOS, Linux | CUDA, ROCm, Metal | Follows llama.cpp | GGUF, local API; official desktop app; one-command pulls via Library ; optional Turbo cloud for large GPT-OSS models. (GitHub , Ollama ) |
| LM Studio | Standalone UI | Win, macOS | CUDA, Metal, Vulkan | Model dependent | GGUF; catalog for GPT-OSS, Qwen3, Gemma3, DeepSeek; local OpenAI-style API; JS/Python SDKs. (LM Studio ) |
| GPT4All Desktop | Standalone UI | Win, macOS, Linux | Embedded llama.cpp | Model dependent | GGUF, local API. (Nomic AI ) |
| Jan | Standalone UI | Win, macOS, Linux | Embedded | Model dependent | GGUF / other formats via runners; local API. (Jan ) |
| text-generation-webui | Standalone UI | Win, macOS, Linux | CUDA, CPU, AMD, Apple Silicon | Model dependent | Transformers, ExLlamaV2/V3, AutoGPTQ, AWQ, GGUF. (GitHub ) |
| Llamafile | Standalone UI | Win, macOS, Linux | Via embedded llama.cpp | Follows llama.cpp | Single-file executables, local API. (GitHub ) |
| Tabby (TabbyML) | Standalone UI | Win, macOS, Linux | CUDA, ROCm, Vulkan | ~8 GB for 7B int8 | Self-hosted code assistant; IDE plugins; REST API. (tabbyml.com , tabby.tabbyml.com ) |
| AMD Gaia | Standalone UI | Win, Linux | Ryzen AI NPU + AMD iGPU/CPU | Model dependent | Multi-agent RAG app around local LLMs (Llama, Phi, etc.), optimized for Ryzen AI PCs; exposes OpenAI-style API and MCP. |
Image / video UIs
| App | Form | OS | GPU accel | VRAM (typical) | Models / Formats |
|---|---|---|---|---|---|
| ComfyUI | Standalone UI | Win, macOS, Linux | CUDA, ROCm, Apple MPS | SD1.5 ≈ 8 GB, SDXL ≈ 12 GB | Node-graph pipelines; 2025 Node 2.0 UI and rich video flows. (GitHub ) |
| AUTOMATIC1111 SD WebUI | Standalone UI | Win, Linux (macOS unofficial) | CUDA, ROCm, DirectML | 4–6 GB workable; more for SDXL | SD1.5/SDXL, many extensions. (GitHub ) |
| InvokeAI | Standalone UI | Win, macOS, Linux | CUDA, AMD via Docker, Apple MPS | 4 GB+ | SD1.5, SDXL, node workflows. (Invoke AI ) |
| Fooocus | Standalone UI | Win, Linux, macOS | CUDA, AMD, Apple MPS | ≥4 GB (NVIDIA) | SDXL presets |
| Stable Video Diffusion | Model + demo | Win, Linux | CUDA | ~14–24 GB common | SVD and SVD-XT image-to-video. (Hugging Face ) |
Hardware sizing (rules of thumb)
These are still rough, but align with 2025 open-weight releases:
- 8 GB VRAM: comfortable for most 7B–8B class models at good speed.
- 16–24 GB VRAM: practical for 12B–20B models and some aggressively quantized 30B+ options.
- 64 GB+ unified memory (Apple Silicon): common sweet spot for local 70B-class quantized runs.
- Bigger context windows still increase memory demand; VRAM/unified memory is usually the limiting resource.
Model formats you’ll see (2026)
| Format | Use with | Notes |
|---|---|---|
| GGUF | llama.cpp, Ollama, LM Studio, KoboldCpp | Most versatile local format; CPU + GPU friendly; native for llama.cpp. (GitHub ) |
| GPTQ | ExLlama, text-generation-webui | GPU-focused format, usually faster than GGUF when full weights fit in VRAM. |
| AWQ | vLLM, TGI, text-generation-webui | Activation-aware quantization |
| EXL2 | ExLlamaV2/V3 | NVIDIA-focused high-speed format for Llama-family models. |
| Safetensors | vLLM, TGI, transformers stacks | Raw model weights format used by many upstream model releases. |
| ONNX | Gaia, custom runtimes, some TGI/vLLM | Framework-agnostic; common in NPU / DirectML / Ryzen AI / edge deployments. |
Modern “flagship” open-weight families like GPT-OSS-20B/120B, Qwen3, Gemma 3, and DeepSeek R-series/V-series usually ship HF safetensors plus community quantizations in GGUF/GPTQ/AWQ/EXL2.
Local RAG building blocks
| Tool | Type | Local friendly |
|---|---|---|
| Chroma | Embedded vector DB | Yes |
| Qdrant | Vector DB | Yes |
| LanceDB | Vector DB on Arrow | Yes |
| SQLite + sqlite-vec | Embedded | Yes |
Tip: keep chunks ~500–1000 tokens, store sources, and version your indexes. Many frontends (Open WebUI, AnythingLLM, Gaia) now have built-in RAG layers using one of these patterns plus an embedding model.
Speech and media blocks
| Task | Tool | Notes |
|---|---|---|
| ASR | faster-whisper | CPU or GPU. Local. |
| TTS | Piper | Small, offline. |
| Diarization | pyannote.audio | Multi-speaker audio. |
60-second installs
Windows (NVIDIA, beginner-friendly)
- Install Ollama (desktop app includes CLI + GUI).
- Open a terminal:
ollama run llama3.2to verify it works. - Install Open WebUI
or LibreChat
. Point to
http://localhost:11434. (GitHub , Ollama , Open WebUI , LibreChat )
Windows (laptop / mini PC, no big GPU)
- Install LM Studio .
- Use its model browser to download a 7B–20B GGUF model (e.g., GPT-OSS-20B, Gemma 3 12B, Qwen3-Coder).
- In the model settings, enable GPU offload to your AMD/Intel iGPU, then enable the local API if you want to connect Open WebUI/LibreChat.
Windows (Ryzen AI)
- Install AMD Gaia using the Hybrid installer on a Ryzen AI PC.
- Choose a built-in agent (chat, YouTube Q&A, code) and attach your documents or repos.
- Optionally call Gaia via its REST API or MCP interface from tools that speak OpenAI-style APIs.
Linux (NVIDIA, server)
- Run vLLM or TGI via Docker to expose an OpenAI-style endpoint.
- Put Open WebUI or LibreChat in front for your UI. (VLLM Documentation , Hugging Face , Open WebUI , LibreChat )
Windows or Linux (desktop GUI)
Use LM Studio or GPT4All. Download a 7B GGUF, enable the local API, then connect your frontend if needed. (LM Studio , Nomic AI )
API interop map
- OpenAI-style servers: llama.cpp server, vLLM, TGI, Ollama, LM Studio, AMD Gaia. (GitHub , VLLM Documentation , Hugging Face , LM Studio , Ollama , 30 )
- Kobold API: KoboldCpp, KoboldAI backends. (GitHub )
- text-generation-webui: adapters for multiple backends. (GitHub )
Most 2025 frontends expect an OpenAI-style API; if your backend exposes one, you can usually swap it in without changing the UI.
MCP is increasingly used alongside (not instead of) OpenAI-style chat APIs: OpenAI-compatible endpoints handle generation, while MCP standardizes tool/context wiring between clients and services.
Hybrid local + cloud routing (the practical “best of both worlds”)
If you want both privacy/cost control and access to top-tier reasoning models, combine:
- Local models (Ollama/llama.cpp/vLLM) for routine requests.
- Cloud models (Anthropic/OpenAI/OpenRouter) for complex tasks.
Two patterns work well:
Approach A: frontend-level routing (fastest setup)
Modern chat frontends can act as lightweight routers by storing multiple provider connections:
- Local connection:
http://localhost:11434(Ollama) - Cloud connection: Anthropic/OpenAI API key
- Aggregator connection: OpenRouter (
https://openrouter.ai/api/v1)
In practice, this gives one model dropdown where you can switch per-chat between local models (for low cost/private data) and cloud models (for heavy reasoning).
Approach B: dedicated API gateway with LiteLLM (recommended for multi-app stacks)
Use this when you want one endpoint shared by:
- a chat UI,
- IDE assistants (VS Code/JetBrains),
- scripts/agents/automation.
LiteLLM sits in front of all providers and exposes one OpenAI-style endpoint (for example http://localhost:4000). Your clients only integrate once; LiteLLM handles routing by model name.
Benefits:
- Centralized key management.
- Unified logs/cost tracking.
- Easier fallback policy (for example, fail over from cloud to local or vice versa).
Docker Compose example (Linux/macOS host)
version: "3.8"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
command: [ "--config", "/app/config.yaml", "--detailed_debug" ]
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
Kubernetes deployment example (gateway service pattern)
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-gateway
labels:
app: litellm
spec:
replicas: 1
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
ports:
- containerPort: 4000
env:
- name: LITELLM_CONFIG
value: "/app/config.yaml"
volumeMounts:
- name: config-volume
mountPath: /app/config.yaml
subPath: config.yaml
volumes:
- name: config-volume
configMap:
name: litellm-config
Store API keys in Kubernetes Secrets (or external secret manager), not in plain ConfigMaps.
Example LiteLLM routing config
model_list:
# Local model via Ollama
- model_name: llama3
litellm_params:
model: ollama/llama3
api_base: http://host.docker.internal:11434
# Direct cloud model (Anthropic)
- model_name: claude-3-5-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY
# Cloud aggregator (OpenRouter)
- model_name: openrouter/wizardlm
litellm_params:
model: openrouter/microsoft/wizardlm-2-8x22b
api_key: os.environ/OPENROUTER_API_KEY
api_base: https://openrouter.ai/api/v1
Windows guide (Docker Desktop or Podman Desktop)
For Windows hosts, this pattern gives you one endpoint for local + cloud models.
Open WebUI direct connection mode (no router)
In Open WebUI settings:
- Set Ollama URL to
http://host.docker.internal:11434(with native Podman,http://host.containers.internal:11434can be required). - Add OpenRouter as an OpenAI-compatible provider with base URL
https://openrouter.ai/api/v1.
LiteLLM gateway mode on Windows
Create C:\ai-gateway\docker-compose.yml:
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- .\config.yaml:/app/config.yaml
command: [ "--config", "/app/config.yaml", "--detailed_debug" ]
extra_hosts:
- "host.docker.internal:host-gateway"
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
Create C:\ai-gateway\config.yaml:
model_list:
- model_name: llama3
litellm_params:
model: ollama/llama3
api_base: http://host.docker.internal:11434
- model_name: claude-3-5-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: openrouter/wizardlm
litellm_params:
model: openrouter/microsoft/wizardlm-2-8x22b
api_key: os.environ/OPENROUTER_API_KEY
api_base: https://openrouter.ai/api/v1
Start in PowerShell:
cd C:\ai-gateway
$env:ANTHROPIC_API_KEY="your-anthropic-key"
$env:OPENROUTER_API_KEY="your-openrouter-key"
# Docker Desktop
docker compose up -d
# Podman Desktop
podman-compose up -d
Test the unified endpoint from PowerShell
$headers = @{
"Content-Type" = "application/json"
"Authorization" = "Bearer dummy-key"
}
$body = @{
"model" = "claude-3-5-sonnet"
"messages" = @(@{ "role" = "user"; "content" = "Hello from Windows!" })
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://localhost:4000/v1/chat/completions" -Method Post -Headers $headers -Body $body
If the request succeeds, all compatible apps can now target http://localhost:4000 and switch models by model_name.
Quick start recommendation (modular)
- Install Ollama.
- Run
ollama run llama3.2to verify your local runtime is working. - Install Open WebUI (Docker or Python install).
- Point Open WebUI to
http://localhost:11434.
That gives you a private local “ChatGPT-like” experience with backend/frontend separation and easy upgrades.
For a walkthrough of modern local stacks, see: Local LLM Hosting: Complete 2025 Guide .
Developer workflows: local coding assistants with Ollama
If your goal is code generation, refactors, and repo-aware agents, the easiest modern pattern is:
- Run Ollama locally as your inference backend.
- Plug it into a coding client (Claude Code, Continue, or Aider).
- Start with a strong coding model and a larger context window (64k is a good target for agent workflows).
Claude Code (terminal-first, agentic)
As of early 2026, Ollama ships a first-party Claude Code integration via ollama launch, so local setup is now mostly one command.
# if needed
ollama pull gemma4:31b
# guided setup + launch
ollama launch claude --model gemma4:31b
You can also run interactive setup without forcing a model:
ollama launch claude
If you prefer manual wiring, Claude Code can talk to Ollama via Anthropic-compatible env vars:
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_API_KEY=""
claude --model gemma4:31b
This is a clean way to keep code, prompts, and model inference fully local while preserving a familiar Claude Code workflow. (Ollama , Ollama Claude Code integration )
Continue.dev (editor-native Copilot-style UX)
Continue works with Ollama in VS Code and JetBrains. Minimal config:
models:
- name: Gemma 4 31B
provider: ollama
model: gemma4:31b
If inline completion feels slow, add a smaller secondary model for autocomplete and keep the larger model for chat/edit actions. Continue also documents context-length tuning for Ollama-backed models when you hit token-window limits. (Continue Docs , Continue Ollama provider )
Aider (git-aware terminal pair programmer)
Aider connects directly to local Ollama models and is strong for repo edits + commit-oriented workflows:
python -m pip install aider-install
aider-install
aider --model ollama_chat/gemma4:31b
ollama_chat/<model> is the recommended prefix in current docs, and context-window settings matter a lot for larger multi-file changes. (Aider Ollama docs
)
Practical recommendation order
- Want terminal agent flow close to Copilot CLI-style usage? Start with Claude Code +
ollama launch. - Want IDE autocomplete + chat? Use Continue.
- Want git-focused CLI pair programming? Use Aider.
All three can run with the same local Ollama backend, so you can switch interfaces without changing your model host.
Model catalogs
| Site | Focus | What you get | |
|---|---|---|---|
| Hugging Face Hub | LLMs (GPT-OSS, Qwen3, Gemma 3, DeepSeek, etc.), vision, audio, SD/SVD | Large model zoo with tooling (Spaces, Inference, datasets) | (Hugging Face ) |
| Civitai | Image/video models, LoRAs, embeddings | Community checkpoints and LoRAs | |
| ModelScope | Broad model repository | Direct downloads and SDK | (ModelScope ) |
| Ollama Library | Local LLM manifests | One-command pulls for Ollama | (Ollama ) |
| LM Studio Discover | Local LLM catalog in-app | Browse and download for local use | (LM Studio ) |
| Stability AI releases | SD, SDXL, SVD | Reference weights and licenses | (Hugging Face ) |
Security and licensing
- Bind to
127.0.0.1by default. - If you must expose a port, use a reverse proxy and auth (at minimum HTTP auth, preferably SSO or VPN).
- Read model licenses (GPT-OSS: Apache-2.0, Qwen/Gemma/DeepSeek: mostly permissive) before commercial use.
- Read app licenses too. Open WebUI v0.6.6+ uses a custom BSD-3-based license with a branding clause; white-labelling or rebranding may require an enterprise license.
- Store API keys in env vars or a secrets manager.
- Treat LLM backends as you would any network service:
- Run them under non-privileged users.
- Keep them patched.
- Only install images/binaries from reputable sources.
- Be aware that local AI stacks can be abused. For example, security researchers have already shown ransomware using GPT-OSS-20B locally via the Ollama API to generate and run attack scripts. If you allow untrusted code to talk to your local LLM stack, include that in your threat model.
Common pitfalls
- Pulling a model larger than your VRAM/RAM. Start small.
- Mixing GPU drivers or CUDA/ROCm versions. Keep them clean.
- Exposing services publicly with no auth.
- Assuming the frontend controls performance. It does not; the backend and model choice do.
- Forgetting that many “frontends” (Open WebUI, AnythingLLM, Gaia) now include RAG indexes and agents; back them up and secure them like any other data store.
References
- Open WebUI: Home
- Quick Start
- KoboldAI Lite
- LostRuins/lite.koboldai.net
- GitHub - KoboldAI/KoboldAI-Client: For GGUF support ...
- AnythingLLM | The all-in-one AI application for everyone
- Mintplex-Labs/anything-llm
- Get started with LM Studio | LM Studio Docs
- Jan.ai
- GPT4All – The Leading Private AI Chatbot for Local ...
- GPT4All
- ggml-org/llama.cpp: LLM inference in C/C++
- vLLM
- Text Generation Inference
- LostRuins/koboldcpp: Run GGUF models easily with a ...
- ollama/ollama: Get up and running with OpenAI gpt-oss, ...
- library
- LM Studio - Discover, download, and run local LLMs
- oobabooga/text-generation-webui: LLM UI with advanced ...
- Mozilla-Ocho/llamafile: Distribute and run LLMs with a ...
- Tabby - Opensource, self-hosted AI coding assistant
- What's Tabby
- comfyanonymous/ComfyUI: The most powerful and ...
- Stable Diffusion web UI
- Invoke
- stabilityai/stable-video-diffusion-img2vid
- The Model Hub
- ModelScope
- Model Catalog
- AMD Gaia
- ollama launch · Ollama Blog
- Claude Code - Ollama
- Using Ollama with Continue
- How to Configure Ollama with Continue
- Ollama | aider
- LiteLLM Docs
- OpenRouter Quickstart