Host Your Own AI Code Assistant with Docker, Ollama and Continue!
TL;DR
You can self-host a GitHub Copilot-style code assistant using Ollama, Docker, and the Continue VS Code plugin — no data sent to Microsoft or OpenAI. A dedicated GPU (AMD or Nvidia) with substantial VRAM is effectively required for usable performance; CPU-only setups are too slow for real-time code suggestions. ---
Key Concepts
Ollama
tap to reveal ↩
Local LLM runtime that can leverage AMD (ROCm) or Nvidia (CUDA) GPUs; runs in Docker
Continue
tap to reveal ↩
VS Code (and soon Neovim) plugin that provides Copilot-style inline completions and chat using a local Ollama endpoint
Open WebUI
tap to reveal ↩
Web-based chat interface for Ollama, self-hosted, similar UX to ChatGPT
ROCm
tap to reveal ↩
AMD's open compute platform required to run LLMs on AMD GPUs; not well-supported on Debian — use Ubuntu
Code models tested
tap to reveal ↩
CodeLlama 7B, Code Booga 34B, StarCoder 3B
Notes
§Motivation
- GitHub Copilot sends code (including secrets) to Microsoft servers — a privacy/security concern
- Goal: context-aware autocomplete (e.g., auto-suggest
owner,group,permissionsin an Ansible task), not full AI code generation - Self-hosting lets you share the instance across multiple devices or users
§Hardware Tested
- Intel i5-1340P (12-core Raptor Lake), 16 GB LPDDR5 RAM (non-upgradeable)
- Idle: ~4.6W; under LLM load: 40–60W
- No dedicated GPU — CPU inference only
- Comparable real-world alternative: Mini PC (Beelink, Topton, Minisforum) ~$300–400
- Ryzen 7 5800X3D + AMD Radeon 7900 XTX (24 GB VRAM)
- Cost: ~€1,500 (summer 2023)
- Idle: ~63W; LLM load: 110–425W, average ~130W
- 7900 XTX fully supported by ROCm
§Software Stack
- OS: Ubuntu Server 22.04 (Debian dropped due to poor ROCm support)
- AMD driver:
amdgpu-installscript with ROCm from AMD's website - Runtime: Docker + Docker Compose
- Models: pulled via Open WebUI browser interface
§Docker Compose Setup
- Two services:
ollama(ROCm image) +open-webui - Open WebUI on port
8080; Ollama API on port11434 - GPU passthrough via device mounts:
/dev/kfdand/dev/dri - Local directories mounted for model and settings persistence
- For CPU-only (Latte Panda): use standard
ollamaimage, remove GPU device mounts
§Continue Plugin Configuration
- Install from VS Code marketplace
- Edit JSON config: set Ollama URL + separate models for chat (larger, e.g. 34B/70B) and autocomplete (lighter, e.g. 7B or 3B)
- Multiple chat models can be specified simultaneously
§Model Performance (Gaming PC / GPU)
- Code Booga 34B: best suggestion quality, slightly slower autocomplete, needs ~20 GB VRAM
- CodeLlama 7B: slightly off on file-type detection but good suggestions, faster
- StarCoder 3B: fast but poor quality — hallucinations, malformed output
- Power draw was ~130W average regardless of model size (7B vs 34B)
- Both models gave sensible Python suggestions in limited testing
§Performance (Latte Panda / CPU-only)
- Code Booga 34B: unusable — requires 20 GB RAM, only 16 GB available
- CodeLlama 7B: works but text generation is very slow
- StarCoder 3B: marginally faster, but output quality collapsed after the first suggestion
- Autocomplete too slow and unreliable to be practical
§Neovim Status
model.nvim,gen.nvim: support custom prompts/macros but not inline autocompletelm.nvim: does autocomplete but slow even on 3B models; poor output quality- Continue developers have a Neovim extension in progress
Actionable Takeaways
- Use Ubuntu (not Debian) if running AMD GPU with ROCm
- Use the ROCm Docker image for Ollama on AMD; standard image for CPU-only
- Mount
/dev/kfdand/dev/driin Docker Compose to expose AMD GPU to Ollama - Set a lightweight model (7B or 3B) for autocomplete and a heavier model for chat in Continue's config
- Don't bother with CPU-only setups for real-time code suggestions — GPU is effectively required
- If you already own a gaming/workstation PC with a high-VRAM GPU, this can replace paid SaaS subscriptions
Quotes Worth Keeping
“
What I want from a quote-unquote AI code assistant is more intelligent and more context-aware auto-suggestions… I would have typed those anyway, but why do that if you can have the machine do it for you.
“
The fact that you can run a large language model… at your own house using free and open source software and consumer hardware — that's amazing. But at the same time it basically needs a high-end graphics card to work well.