Self-Hosted AI Coding Assistant: Complete Setup Guide

A complete guide to building a free, private, and powerful AI coding assistant that works on consumer hardware. This setup combines VSCode with agentic capabilities, local language models for code generation, and intelligent documentation lookup to minimize hallucinations.

Overview

This stack delivers a self-hosted AI coding environment that addresses four key requirements:

  1. Reasonable performance across various hardware configurations
  2. Reduced hallucination through real documentation grounding
  3. Zero subscription costs (hardware electricity only)
  4. Support for lesser-used languages and frameworks

The solution combines five core components that work together seamlessly: VSCode as the editor platform, RooCode for agentic coding capabilities, LM Studio as the local model server, Devstral for code generation, snowflake-arctic-embed2 for semantic search, and docs-mcp-server to connect everything with live documentation.


Hardware Requirements

Tested Configuration:

  • CPU: AMD Ryzen 5700
  • GPU: NVIDIA RTX 3080 (10GB VRAM)
  • RAM: 48GB system memory

This configuration comfortably runs Devstral at IQ2 quantization with a 32,768 token context window. The setup should function on both lower-end hardware (with appropriate model and quantization choices) and higher-end systems (allowing for larger models or longer contexts). Your actual experience will depend on the specific models and quantizations you select.


Part 1: Editor Installation

Primary Option: VSCode with RooCode Extension

Visual Studio Code provides an extensible foundation for AI-assisted development. The RooCode extension adds agentic coding capabilities and MCP (Model Context Protocol) support, enabling sophisticated interactions with your local models.

Installation Steps:

  1. Download Visual Studio Code from code.visualstudio.com

  2. Install the RooCode extension:

    • Open VSCode
    • Navigate to Extensions (Ctrl+Shift+X)
    • Search for "RooCode" or "Cline"
    • Install: RooVeterinaryInc.roo-cline
  3. Configure initial settings when prompted by the extension

Alternative: VSCodium

For privacy-conscious users who prefer telemetry-free software, VSCodium provides identical functionality without Microsoft's data collection:


Part 2: Local Model Server Setup

LM Studio serves as your local inference engine, providing an OpenAI-compatible API that RooCode can consume. It handles both the language model for code generation and the embedding model for semantic search.

Download and Installation:

  1. Download LM Studio from lmstudio.ai
  2. Install following platform-specific instructions
  3. Launch the application

Loading Your Models:

The LM Studio interface requires a specific navigation path to access all features:

  • For Language Models: Use the main chat interface or "Power User" tab
  • For Embedding Models: You must use the "Developer" tab—this is not accessible from the standard chat interface and represents a common point of confusion

Starting the Server:

  1. Click "Power User" in the sidebar
  2. Load your chosen language model (Devstral recommended, see next section)
  3. If using embeddings for docs-mcp-server, load snowflake-arctic-embed2 from the Developer tab
  4. Click "Start Server" to expose the OpenAI-compatible API

Server URL: Your local server will be available at http://localhost:1234/v1

Alternative: Direct llama.cpp Implementation

For users preferring a minimal setup without GUI wrappers, llama.cpp provides direct model serving with lower resource overhead:

  • Eliminates wrapper dependencies and update management
  • Reduces background processes during development
  • Pairs well with llama-swap for automatic model loading/unloading

Implementation Note: Both LM Studio and llama.cpp produce OpenAI-compatible endpoints, meaning RooCode cannot distinguish between them. This abstraction allows you to switch implementations without modifying your editor configuration.


Part 3: Language Model Selection

Recommended: Devstral (Unsloth Fine-tune)

Devstral is specifically optimized for coding tasks with strong tool-use capabilities, making it ideal for agentic coding workflows. The devstral-small-2505@iq2_m variant fits comfortably within 10GB of VRAM while maintaining reasonable quality.

Recommended Quantization Hierarchy:

Quantization VRAM Usage Quality Assessment
Q4_K_M ~7GB Recommended for best quality/performance balance
IQ4_XS ~6.5GB Minimum acceptable for agentic workflows
IQ2_M ~5GB Functional but severely compressed (use only if necessary)

Configuration in LM Studio:

  1. Download Devstral from your preferred model repository (HuggingFace, etc.)
  2. Load the model through LM Studio
  3. Configure context length to 32,768 tokens for complex projects

Alternative Models Worth Considering

Several alternatives may suit specific use cases or hardware configurations:

Qwen3: Demonstrates strong context handling and works well at extended context windows (tested up to 52K). Some users report needing around 24GB available RAM for optimal RooCode performance with Qwen models.

DeepSeek-Coder: Provides excellent explanations alongside code generation, running nearly instantly on RTX 4090 hardware. Users report it feels more helpful than Devstral or Codestral for understanding-driven workflows.

VLLM + Qwen3-32B: For users wanting maximum capability without quantization artifacts, this combination offers speed and capabilities comparable to commercial solutions when paired with sufficient GPU resources.

Mistral Small (IQ4): Reported as functional for coding tasks by some community members, though Devstral remains the recommended starting point.


Part 4: Embedding Model Setup

The embedding model powers semantic search within docs-mcp-server, enabling accurate documentation lookup without bloating your context window with entire reference pages.

Recommended: snowflake-arctic-embed2

This compact embedding model provides excellent performance for documentation retrieval while remaining lightweight enough to run alongside your language model.

Model Name: text-embedding-snowflake-arctic-embed-l-v2.0

Loading in LM Studio:

  1. Switch to the Developer tab (not accessible from chat interface)
  2. Search for and load the embedding model
  3. The model will be ready when docs-mcp-server queries it

Alternative Embedding Options

Qwen3 0.6b Embedding: Recent releases suggest this smaller Qwen variant may outperform Arctic embeddings when properly configured, though real-world testing is ongoing.


Part 5: Documentation MCP Server

The docs-mcp-server component bridges your local models with official documentation, grounding responses in verified sources and dramatically reducing hallucination rates for syntax and API usage.

Installation via Docker (Recommended)

Docker provides isolated execution with easier management through tools like Portainer.

Prerequisites:

MCP Configuration:

Add the following to your RooCode MCP settings file (mcp_settings.json):

{
  "mcpServers": {
    "docs-mcp-server": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "-p",
        "6280:6280",
        "-p",
        "6281:6281",
        "-e",
        "OPENAI_API_KEY",
        "-e",
        "OPENAI_API_BASE",
        "-e",
        "DOCS_MCP_EMBEDDING_MODEL",
        "-v",
        "docs-mcp-data:/data",
        "ghcr.io/arabold/docs-mcp-server:latest"
      ],
      "env": {
        "OPENAI_API_KEY": "ollama",
        "OPENAI_API_BASE": "http://host.docker.internal:1234/v1",
        "DOCS_MCP_EMBEDDING_MODEL": "text-embedding-snowflake-arctic-embed-l-v2.0"
      }
    }
  }
}

Installation via Node.js (Alternative)

If Docker is not available, the docs-mcp-server can run directly through Node.js:

{
  "mcpServers": {
    "docs-mcp-server": {
      "command": "npx",
      "args": [
        "@arabold/docs-mcp-server@latest"
      ],
      "env": {
        "OPENAI_API_KEY": "ollama",
        "OPENAI_API_BASE": "http://host.docker.internal:1234/v1",
        "DOCS_MCP_EMBEDDING_MODEL": "text-embedding-snowflake-arctic-embed-l-v2.0"
      }
    }
  }
}

Note: The OPENAI_API_KEY value of "ollama" is a placeholder; the server authenticates via the API base URL rather than actual key validation.


Part 6: Adding Documentation

Once configured, teach your assistant about specific languages and frameworks by scraping their official documentation.

Commands for Documentation Import

Required Parameters:

Parameter Description
url Link to the documentation root or relevant section
library Name of the programming language or library
version Specific version string (enables accurate syntax for your version)

Optional Parameters:

{
  "maxPages": 1000,       // Maximum pages to crawl (default)
  "maxDepth": 3,          // Navigation depth limit
  "scope": "subpages",    // Crawling boundary: 'subpages', 'hostname', or 'domain'
  "followRedirects": true // Handle HTTP 3xx redirects automatically
}

Example Usage:

Please scrape the Flutter documentation using:
- URL: https://docs.flutter.dev/
- Library: flutter
- Version: 3.24.0
- MaxDepth: 4

Using Documentation During Development

When working with unfamiliar APIs or verifying syntax, instruct your assistant:

Search the docs for how to implement pagination in Flutter Riverpod v2.8.

The MCP server automatically retrieves relevant documentation and provides it as context, ensuring accurate implementation guidance without manual web searches.


Part 7: Editor Configuration Reference

RooCode Settings Location

The MCP configuration file (mcp_settings.json) is located within VSCode's settings directory:

  • Windows: %APPDATA%\Code\User\globalStorage\rooveterinaryinc.roo-cline\mcp_settings.json
  • macOS: ~/Library/Application Support/Code/User/globalStorage/rooveterinaryinc.roo-cline/mcp_settings.json
  • Linux: ~/.config/Code/User/globalStorage/rooveterinaryinc.roo-cline/mcp_settings.json

Quick Configuration Checklist

  1. ✅ VSCode installed with RooCode extension
  2. ✅ LM Studio running with Devstral loaded
  3. ✅ snowflake-arctic-embed2 loaded in LM Studio Developer tab
  4. ✅ LM Studio server started on port 1234
  5. ✅ docs-mcp-server configured and running
  6. ✅ Documentation for your primary languages imported

Alternative Editor Options

Zed Editor

Zed offers a polished, performance-focused editing experience with AI integration:

  • Platforms: macOS and Linux (Windows version in development; nightly builds available)
  • Limitation: Cannot pass editor problems as context to the AI
  • Trade-off: Superior performance but less contextual awareness

Void Editor

Void is a VSCode fork with integrated AI features:

  • Marketplace Access: Retains access to the full VSCode extension ecosystem
  • Built-in Agentic Coding: Lightweight agent implementation compared to RooCode
  • Privacy Concerns: Some telemetry concerns (see GitHub discussion)
  • Recommendation: Worth monitoring, particularly if RooCode proves problematic

Continue and Aider

These tools offer alternative approaches to AI-assisted coding:

  • Continue: Non-agentic IDE extension focused on inline assistance
  • Aider: Command-line focused implementation; choice depends on workflow preference

Performance Considerations

Context Management

The 32K token context window works effectively when combined with docs-mcp-server for documentation lookup. The RAG (Retrieval-Augmented Generation) approach provides necessary details without consuming your context budget.

If you need more context: Qwen3 has been tested successfully at 52K tokens, though system requirements increase proportionally.

Quantization Trade-offs

Lower quantizations save VRAM but impact quality:

  • IQ2_M: Borderline usable for agentic workflows; expect more errors requiring human correction
  • IQ4_XS: Minimum recommended for reliable coding tasks
  • Q4_K_M: Preferred balance of quality and resource usage

Hardware Scaling

For users with access to additional GPUs, consider adding a p104-100 (~$25 used) as a secondary card. This allows running higher-quality quantizations (IQ4 or better) without compromising your primary graphics workload.


Complete Docker Compose Reference

For users wanting all services defined in a single configuration file:

version: '3.8'

services:
  docs-mcp-server:
    image: ghcr.io/arabold/docs-mcp-server:latest
    ports:
      - "6280:6280"
      - "6281:6281"
    environment:
      - OPENAI_API_KEY=ollama
      - OPENAI_API_BASE=http://host.docker.internal:1234/v1
      - DOCS_MCP_EMBEDDING_MODEL=text-embedding-snowflake-arctic-embed-l-v2.0
    volumes:
      - docs-mcp-data:/data

volumes:
  docs-mcp-data:

Note: LM Studio itself is a desktop application and does not run in Docker; this compose file only covers the MCP server component.


Troubleshooting Common Issues

LM Studio Embedding Model Not Found

Embedding models are only accessible through the Developer tab, which is separate from the chat interface. This UI design often trips up new users—switch to Developer view to load embedding models.

Server Connection Failures

  1. Verify LM Studio server is running (look for "Server Running" indicator)
  2. Check port 1234 is not in use by another application
  3. Confirm OPENAI_API_BASE points to your local instance: http://host.docker.internal:1234/v1
  4. For Docker on Linux, host.docker.internal may require network configuration

Model Quality Issues with Devstral

If tool calls fail frequently:

  • Consider custom system prompts that emphasize tool usage patterns
  • Qwen3 or VLLM backends may provide more reliable tool handling for some users
  • Ensure context window isn't maxed out, which degrades model reasoning

Conclusion

This setup provides a production-ready, self-hosted alternative to commercial AI coding assistants. The combination of agentic coding through RooCode, local inference via LM Studio, and documentation grounding through docs-mcp-server creates a system that learns from your project's actual dependencies rather than training data.

The stack remains flexible—swap the embedding model for newer alternatives, replace Devstral with Qwen3 or DeepSeek-Coder as they improve, or migrate to llama.cpp when you want minimal overhead. The OpenAI-compatible API ensures your editor configuration remains stable regardless of backend changes.

Regularly update your local models and documentation cache as new versions release. Languages evolve; keeping your documentation current prevents regressions in generated code quality.

Source Link