pastebin.richardson.dev

Self-Hosted Local AI Tools

Posted Aug 19, 2025 Modified at Dec 7, 202524.1 KB • Markdown Print Raw

You can run AI locally with two parts:

  • Backend: the engine that loads and runs models.
  • Frontend: the UI you click on. Historically, frontends just talked to a backend and did not decide GPU or VRAM. In 2025 many ship an embedded runner or RAG engine, but performance still comes from the backend you configure.

For example, Open WebUI can run fully offline, connect to Ollama / llama.cpp / vLLM / TGI / LM Studio / other OpenAI-style APIs, and now includes multi-user channels, DMs, knowledge bases + RAG, and tool integrations. (Open WebUI )

Updated December 2025 for GPT-OSS, newer Open WebUI releases, AMD Gaia, and recent Ollama/LM Studio changes.


Quick start

  1. Pick a backend that fits your hardware and model format.
  2. Pick a frontend to chat or manage models.
  3. Or choose an all-in-one app that bundles both.
  4. Download a model from a catalog you trust.
  5. Check licenses (model + app). Bind services to localhost unless you add auth.

Windows vs Linux: what to install first

Windows

  • Discrete NVIDIA GPU (desktop/server)

    • Use Ollama, LM Studio, llama.cpp, or text-generation-webui.
    • Ollama now ships an official desktop app for Windows/macOS with GUI chat, history, and drag-and-drop files, while still exposing the same local API/CLI.
    • For high-throughput multi-model serving, run vLLM on WSL2 or a Linux host and expose an OpenAI-style endpoint.
  • Laptops / mini PCs with iGPU (AMD/Intel) or Intel Arc

    • Prefer LM Studio or llama.cpp / KoboldCpp builds that use Vulkan. LM Studio can offload layers to AMD and Intel iGPUs via Vulkan, which is significantly faster than CPU-only on integrated-graphics systems.
    • Start with 7B–13B GGUF models; 20B (e.g., gpt-oss-20b) is realistic on 16 GB-class machines with good offload.
  • Ryzen AI laptops / NPUs

    • Consider AMD Gaia: GUI + CLI that runs local LLM agents on Ryzen AI NPU + iGPU, using a RAG pipeline and an OpenAI-compatible REST API. It also runs on non-Ryzen systems (at lower performance).
  • Frontends

    • Open WebUI or LibreChat pointed at your local endpoint. Open WebUI is now closer to a self-hosted “ChatGPT + team chat” for your own models (channels, DMs, knowledge bases, tools) and uses a custom BSD-3-based license with a branding clause in v0.6.x.
    • Page-level browser extensions (e.g., Page Assist) can talk to Ollama/LM Studio APIs if you prefer in-browser chat.

Linux

  • NVIDIA (server / homelab)

    • vLLM (v1) and Text Generation Inference (TGI) 3.x are the standard high-throughput OpenAI-style servers. vLLM focuses on efficient serving; recent releases add architectural speed-ups and improved multimodal support. TGI adds multi-backend support (TensorRT-LLM, vLLM, etc.).
    • For simpler setups, Ollama or llama.cpp remain practical single-node servers.
  • AMD GPUs

    • Use ROCm builds when available (vLLM/TGI/llama.cpp), or KoboldCpp / llama.cpp with Vulkan.
    • Gaia has Linux support as well, but its sweet spot is Ryzen AI laptops with NPU + iGPU.
  • Use Docker for servers. Bind to 127.0.0.1 and put a UI (Open WebUI, LibreChat, etc.) in front.

Rule of thumb: VRAM caps the model size. Smaller quantized models still beat larger high-precision models on weak hardware. Start with 7B, then move up.


Frontend

Frontends are thin clients from the user’s perspective. They connect to whatever backend you run (local engines, remote APIs, or both). Some now bundle light backends, RAG, and multi-user features, but you still point them at LLM endpoints.

AppOSConnects toNotes
Open WebUIWin, macOS, LinuxOpenAI-style endpoints, Ollama, llama.cpp, vLLM, TGI, LM StudioExtensible default; channels + DMs, KB/RAG, tools; custom BSD-3-based license (v0.6.6+).(Open WebUI )
SillyTavernWin, macOS, LinuxKoboldAI/KoboldCpp, text-gen-webui, Ollama, OpenAI-styleRP and character tools
LibreChatWin, macOS, LinuxOpenAI-style endpointsTeam features; custom endpoints(LibreChat )
Kobold LiteAny browserKoboldAI/KoboldCpp, AI HordeZero-install client(lite.koboldai.net , GitHub )
KoboldAI ClientWin, macOS, LinuxLocal or remote LLM backendsStory-writing UI(GitHub )
AnythingLLMWin, macOS, LinuxOllama or APIsBuilt-in RAG, project-style workspaces(anythingllm.com , GitHub )
LM Studio (UI)Win, macOSBuilt-in local server, OpenAI-styleCatalog for GPT-OSS/Qwen3/Gemma3/DeepSeek; Vulkan iGPU offload; exposes local OpenAI API; SDKs(LM Studio )
JanWin, macOS, LinuxBuilt-in local server, OpenAI-styleOffline-first desktop app, supports modern open-weight models(Jan )
GPT4All DesktopWin, macOS, LinuxBuilt-in local serverPrivate, on-device; large local model catalog(Nomic AI , docs.gpt4all.io )

Backend

Engines that load models and expose a local API.

AppOSGPU accelVRAM (typical)Models / Formats
llama.cpp (llama-server)Win, macOS, LinuxCUDA, Metal, HIP/ROCm, Vulkan, SYCL7B q4 ≈ 4 GB; 13B q4 ≈ 8 GBGGUF, OpenAI-style server. GGUF is the native format. (GitHub )
vLLMLinux, Win WSLCUDA, ROCmModel dependentTransformers; high-throughput OpenAI-style server; 2025 v1 architecture improves throughput + multimodal. (VLLM Documentation )
Text Generation Inference (TGI)LinuxCUDA, ROCmModel dependentHF production server; 3.x adds multi-backend support (TensorRT-LLM, vLLM) and mature deployment tooling. (Hugging Face )
KoboldCppWin, macOS, LinuxCUDA, ROCm, Metal, Vulkan7B q4 ≈ 4 GBGGUF, Kobold API; focus on story/RP workloads. (GitHub )
MLX LLMmacOS (Apple Silicon)Apple MLXModel dependentMLX or GGUF-converted
TensorRT-LLMLinuxNVIDIA TensorRTHigh for fp16Transformers; max-throughput NVIDIA deployment

Both: frontend + backend in one

AppFormOSGPU accelVRAM (typical)Models / Formats
OllamaCLI + API + GUIWin, macOS, LinuxCUDA, ROCm, MetalFollows llama.cppGGUF, local API; official desktop app; one-command pulls via Library ; optional Turbo cloud for large GPT-OSS models. (GitHub , Ollama )
LM StudioStandalone UIWin, macOSCUDA, Metal, VulkanModel dependentGGUF; catalog for GPT-OSS, Qwen3, Gemma3, DeepSeek; local OpenAI-style API; JS/Python SDKs. (LM Studio )
GPT4All DesktopStandalone UIWin, macOS, LinuxEmbedded llama.cppModel dependentGGUF, local API. (Nomic AI )
JanStandalone UIWin, macOS, LinuxEmbeddedModel dependentGGUF / other formats via runners; local API. (Jan )
text-generation-webuiStandalone UIWin, macOS, LinuxCUDA, CPU, AMD, Apple SiliconModel dependentTransformers, ExLlamaV2/V3, AutoGPTQ, AWQ, GGUF. (GitHub )
LlamafileStandalone UIWin, macOS, LinuxVia embedded llama.cppFollows llama.cppSingle-file executables, local API. (GitHub )
Tabby (TabbyML)Standalone UIWin, macOS, LinuxCUDA, ROCm, Vulkan~8 GB for 7B int8Self-hosted code assistant; IDE plugins; REST API. (tabbyml.com , tabby.tabbyml.com )
AMD GaiaStandalone UIWin, LinuxRyzen AI NPU + AMD iGPU/CPUModel dependentMulti-agent RAG app around local LLMs (Llama, Phi, etc.), optimized for Ryzen AI PCs; exposes OpenAI-style API and MCP.

Image / video UIs

AppFormOSGPU accelVRAM (typical)Models / Formats
ComfyUIStandalone UIWin, macOS, LinuxCUDA, ROCm, Apple MPSSD1.5 ≈ 8 GB, SDXL ≈ 12 GBNode-graph pipelines; 2025 Node 2.0 UI and rich video flows. (GitHub )
AUTOMATIC1111 SD WebUIStandalone UIWin, Linux (macOS unofficial)CUDA, ROCm, DirectML4–6 GB workable; more for SDXLSD1.5/SDXL, many extensions. (GitHub )
InvokeAIStandalone UIWin, macOS, LinuxCUDA, AMD via Docker, Apple MPS4 GB+SD1.5, SDXL, node workflows. (Invoke AI )
FooocusStandalone UIWin, Linux, macOSCUDA, AMD, Apple MPS≥4 GB (NVIDIA)SDXL presets
Stable Video DiffusionModel + demoWin, LinuxCUDA~14–24 GB commonSVD and SVD-XT image-to-video. (Hugging Face )

Hardware sizing (plain rules)

These are still rough, but align with 2025 open-weight releases:

  • 7B q4: ~4 GB VRAM/RAM.
  • 13B q4: ~8 GB.
  • 20B (e.g., gpt-oss-20b): ~16 GB VRAM or a mix of VRAM + fast RAM.
  • 70B in heavy quant: ≥24 GB VRAM, often more.
  • Bigger context windows need more memory. Prioritize VRAM (or NPU-accessible RAM) over raw GPU cores for LLMs.

Model formats you’ll see

FormatUse withNotes
GGUFllama.cpp, Ollama, LM Studio, KoboldCppQuantized, CPU/GPU-friendly. Native for llama.cpp. (GitHub )
GPTQExLlama, text-generation-webuiNVIDIA-focused, good chat speed
AWQvLLM, TGI, text-generation-webuiActivation-aware quantization
EXL2ExLlamaV2/V3Optimized GPTQ variant for Llama-family
ONNXGaia, custom runtimes, some TGI/vLLMFramework-agnostic; often used for NPU / DirectML / Ryzen AI / edge deployments via SDKs

Modern “flagship” open-weight families like GPT-OSS-20B/120B, Qwen3, Gemma 3, and DeepSeek R-series/V-series usually ship HF safetensors plus community quantizations in GGUF/GPTQ/AWQ/EXL2.


Local RAG building blocks

ToolTypeLocal friendly
ChromaEmbedded vector DBYes
QdrantVector DBYes
LanceDBVector DB on ArrowYes
SQLite + sqlite-vecEmbeddedYes

Tip: keep chunks ~500–1000 tokens, store sources, and version your indexes. Many frontends (Open WebUI, AnythingLLM, Gaia) now have built-in RAG layers using one of these patterns plus an embedding model.


Speech and media blocks

TaskToolNotes
ASRfaster-whisperCPU or GPU. Local.
TTSPiperSmall, offline.
Diarizationpyannote.audioMulti-speaker audio.

60-second installs

Windows (NVIDIA, beginner-friendly)

  1. Install Ollama (desktop app includes CLI + GUI).
  2. Open a terminal: ollama run gpt-oss:20b or ollama run llama3 to test.
  3. Install Open WebUI or LibreChat . Point to http://localhost:11434. (GitHub , Ollama , Open WebUI , LibreChat )

Windows (laptop / mini PC, no big GPU)

  1. Install LM Studio .
  2. Use its model browser to download a 7B–20B GGUF model (e.g., GPT-OSS-20B, Gemma 3 12B, Qwen3-Coder).
  3. In the model settings, enable GPU offload to your AMD/Intel iGPU, then enable the local API if you want to connect Open WebUI/LibreChat.

Windows (Ryzen AI)

  1. Install AMD Gaia using the Hybrid installer on a Ryzen AI PC.
  2. Choose a built-in agent (chat, YouTube Q&A, code) and attach your documents or repos.
  3. Optionally call Gaia via its REST API or MCP interface from tools that speak OpenAI-style APIs.

Linux (NVIDIA, server)

  1. Run vLLM or TGI via Docker to expose an OpenAI-style endpoint.
  2. Put Open WebUI or LibreChat in front for your UI. (VLLM Documentation , Hugging Face , Open WebUI , LibreChat )

Windows or Linux (desktop GUI)

Use LM Studio or GPT4All. Download a 7B GGUF, enable the local API, then connect your frontend if needed. (LM Studio , Nomic AI )


API interop map

Most 2025 frontends expect an OpenAI-style API; if your backend exposes one, you can usually swap it in without changing the UI.


Model catalogs

SiteFocusWhat you get
Hugging Face HubLLMs (GPT-OSS, Qwen3, Gemma 3, DeepSeek, etc.), vision, audio, SD/SVDLarge model zoo with tooling (Spaces, Inference, datasets)(Hugging Face )
CivitaiImage/video models, LoRAs, embeddingsCommunity checkpoints and LoRAs
ModelScopeBroad model repositoryDirect downloads and SDK(ModelScope )
Ollama LibraryLocal LLM manifestsOne-command pulls for Ollama(Ollama )
LM Studio DiscoverLocal LLM catalog in-appBrowse and download for local use(LM Studio )
Stability AI releasesSD, SDXL, SVDReference weights and licenses(Hugging Face )

Security and licensing

  • Bind to 127.0.0.1 by default.
  • If you must expose a port, use a reverse proxy and auth (at minimum HTTP auth, preferably SSO or VPN).
  • Read model licenses (GPT-OSS: Apache-2.0, Qwen/Gemma/DeepSeek: mostly permissive) before commercial use.
  • Read app licenses too. Open WebUI v0.6.6+ uses a custom BSD-3-based license with a branding clause; white-labelling or rebranding may require an enterprise license.
  • Store API keys in env vars or a secrets manager.
  • Treat LLM backends as you would any network service:
    • Run them under non-privileged users.
    • Keep them patched.
    • Only install images/binaries from reputable sources.
  • Be aware that local AI stacks can be abused. For example, security researchers have already shown ransomware using GPT-OSS-20B locally via the Ollama API to generate and run attack scripts. If you allow untrusted code to talk to your local LLM stack, include that in your threat model.

Common pitfalls

  • Pulling a model larger than your VRAM/RAM. Start small.
  • Mixing GPU drivers or CUDA/ROCm versions. Keep them clean.
  • Exposing services publicly with no auth.
  • Assuming the frontend controls performance. It does not; the backend and model choice do.
  • Forgetting that many “frontends” (Open WebUI, AnythingLLM, Gaia) now include RAG indexes and agents; back them up and secure them like any other data store.
References
  1. Open WebUI: Home
  2. Quick Start
  3. KoboldAI Lite
  4. LostRuins/lite.koboldai.net
  5. GitHub - KoboldAI/KoboldAI-Client: For GGUF support ...
  6. AnythingLLM | The all-in-one AI application for everyone
  7. Mintplex-Labs/anything-llm
  8. Get started with LM Studio | LM Studio Docs
  9. Jan.ai
  10. GPT4All – The Leading Private AI Chatbot for Local ...
  11. GPT4All
  12. ggml-org/llama.cpp: LLM inference in C/C++
  13. vLLM
  14. Text Generation Inference
  15. LostRuins/koboldcpp: Run GGUF models easily with a ...
  16. ollama/ollama: Get up and running with OpenAI gpt-oss, ...
  17. library
  18. LM Studio - Discover, download, and run local LLMs
  19. oobabooga/text-generation-webui: LLM UI with advanced ...
  20. Mozilla-Ocho/llamafile: Distribute and run LLMs with a ...
  21. Tabby - Opensource, self-hosted AI coding assistant
  22. What's Tabby
  23. comfyanonymous/ComfyUI: The most powerful and ...
  24. Stable Diffusion web UI
  25. Invoke
  26. stabilityai/stable-video-diffusion-img2vid
  27. The Model Hub
  28. ModelScope
  29. Model Catalog
  30. AMD Gaia