Host Your Own AI Code Assistant with Docker, Ollama and Continue!

Wolfgang's Channel · 2026-05-21 ·▶ Watch on YouTube ·via captions ·2 min read
TL;DR

You can self-host a GitHub Copilot-style code assistant using Ollama, Docker, and the Continue VS Code plugin — no data sent to Microsoft or OpenAI. A dedicated GPU (AMD or Nvidia) with substantial VRAM is effectively required for usable performance; CPU-only setups are too slow for real-time code suggestions. ---

Key Concepts

Ollama
tap to reveal ↩
Local LLM runtime that can leverage AMD (ROCm) or Nvidia (CUDA) GPUs; runs in Docker
Continue
tap to reveal ↩
VS Code (and soon Neovim) plugin that provides Copilot-style inline completions and chat using a local Ollama endpoint
Open WebUI
tap to reveal ↩
Web-based chat interface for Ollama, self-hosted, similar UX to ChatGPT
ROCm
tap to reveal ↩
AMD's open compute platform required to run LLMs on AMD GPUs; not well-supported on Debian — use Ubuntu
Code models tested
tap to reveal ↩
CodeLlama 7B, Code Booga 34B, StarCoder 3B

Notes

§Motivation

  • GitHub Copilot sends code (including secrets) to Microsoft servers — a privacy/security concern
  • Goal: context-aware autocomplete (e.g., auto-suggest owner, group, permissions in an Ansible task), not full AI code generation
  • Self-hosting lets you share the instance across multiple devices or users

§Hardware Tested

  • Intel i5-1340P (12-core Raptor Lake), 16 GB LPDDR5 RAM (non-upgradeable)
  • Idle: ~4.6W; under LLM load: 40–60W
  • No dedicated GPU — CPU inference only
  • Comparable real-world alternative: Mini PC (Beelink, Topton, Minisforum) ~$300–400
  • Ryzen 7 5800X3D + AMD Radeon 7900 XTX (24 GB VRAM)
  • Cost: ~€1,500 (summer 2023)
  • Idle: ~63W; LLM load: 110–425W, average ~130W
  • 7900 XTX fully supported by ROCm

§Software Stack

  • OS: Ubuntu Server 22.04 (Debian dropped due to poor ROCm support)
  • AMD driver: amdgpu-install script with ROCm from AMD's website
  • Runtime: Docker + Docker Compose
  • Models: pulled via Open WebUI browser interface

§Docker Compose Setup

  • Two services: ollama (ROCm image) + open-webui
  • Open WebUI on port 8080; Ollama API on port 11434
  • GPU passthrough via device mounts: /dev/kfd and /dev/dri
  • Local directories mounted for model and settings persistence
  • For CPU-only (Latte Panda): use standard ollama image, remove GPU device mounts

§Continue Plugin Configuration

  • Install from VS Code marketplace
  • Edit JSON config: set Ollama URL + separate models for chat (larger, e.g. 34B/70B) and autocomplete (lighter, e.g. 7B or 3B)
  • Multiple chat models can be specified simultaneously

§Model Performance (Gaming PC / GPU)

  • Code Booga 34B: best suggestion quality, slightly slower autocomplete, needs ~20 GB VRAM
  • CodeLlama 7B: slightly off on file-type detection but good suggestions, faster
  • StarCoder 3B: fast but poor quality — hallucinations, malformed output
  • Power draw was ~130W average regardless of model size (7B vs 34B)
  • Both models gave sensible Python suggestions in limited testing

§Performance (Latte Panda / CPU-only)

  • Code Booga 34B: unusable — requires 20 GB RAM, only 16 GB available
  • CodeLlama 7B: works but text generation is very slow
  • StarCoder 3B: marginally faster, but output quality collapsed after the first suggestion
  • Autocomplete too slow and unreliable to be practical

§Neovim Status

  • model.nvim, gen.nvim: support custom prompts/macros but not inline autocomplete
  • lm.nvim: does autocomplete but slow even on 3B models; poor output quality
  • Continue developers have a Neovim extension in progress

Actionable Takeaways

  1. Use Ubuntu (not Debian) if running AMD GPU with ROCm
  2. Use the ROCm Docker image for Ollama on AMD; standard image for CPU-only
  3. Mount /dev/kfd and /dev/dri in Docker Compose to expose AMD GPU to Ollama
  4. Set a lightweight model (7B or 3B) for autocomplete and a heavier model for chat in Continue's config
  5. Don't bother with CPU-only setups for real-time code suggestions — GPU is effectively required
  6. If you already own a gaming/workstation PC with a high-VRAM GPU, this can replace paid SaaS subscriptions

Quotes Worth Keeping

What I want from a quote-unquote AI code assistant is more intelligent and more context-aware auto-suggestions… I would have typed those anyway, but why do that if you can have the machine do it for you.

The fact that you can run a large language model… at your own house using free and open source software and consumer hardware — that's amazing. But at the same time it basically needs a high-end graphics card to work well.