A complete guide to building a free, private, and powerful AI coding assistant that works on consumer hardware. This setup combines VSCode with agentic capabilities, local language models for code generation, and intelligent documentation lookup to minimize hallucinations.
This stack delivers a self-hosted AI coding environment that addresses four key requirements:
The solution combines five core components that work together seamlessly: VSCode as the editor platform, RooCode for agentic coding capabilities, LM Studio as the local model server, Devstral for code generation, snowflake-arctic-embed2 for semantic search, and docs-mcp-server to connect everything with live documentation.
Tested Configuration:
This configuration comfortably runs Devstral at IQ2 quantization with a 32,768 token context window. The setup should function on both lower-end hardware (with appropriate model and quantization choices) and higher-end systems (allowing for larger models or longer contexts). Your actual experience will depend on the specific models and quantizations you select.
Visual Studio Code provides an extensible foundation for AI-assisted development. The RooCode extension adds agentic coding capabilities and MCP (Model Context Protocol) support, enabling sophisticated interactions with your local models.
Installation Steps:
Download Visual Studio Code from code.visualstudio.com
Install the RooCode extension:
RooVeterinaryInc.roo-clineConfigure initial settings when prompted by the extension
Alternative: VSCodium
For privacy-conscious users who prefer telemetry-free software, VSCodium provides identical functionality without Microsoft's data collection:
LM Studio serves as your local inference engine, providing an OpenAI-compatible API that RooCode can consume. It handles both the language model for code generation and the embedding model for semantic search.
Download and Installation:
Loading Your Models:
The LM Studio interface requires a specific navigation path to access all features:
Starting the Server:
snowflake-arctic-embed2 from the Developer tabServer URL: Your local server will be available at http://localhost:1234/v1
For users preferring a minimal setup without GUI wrappers, llama.cpp provides direct model serving with lower resource overhead:
Implementation Note: Both LM Studio and llama.cpp produce OpenAI-compatible endpoints, meaning RooCode cannot distinguish between them. This abstraction allows you to switch implementations without modifying your editor configuration.
Devstral is specifically optimized for coding tasks with strong tool-use capabilities, making it ideal for agentic coding workflows. The devstral-small-2505@iq2_m variant fits comfortably within 10GB of VRAM while maintaining reasonable quality.
Recommended Quantization Hierarchy:
| Quantization | VRAM Usage | Quality Assessment |
|---|---|---|
| Q4_K_M | ~7GB | Recommended for best quality/performance balance |
| IQ4_XS | ~6.5GB | Minimum acceptable for agentic workflows |
| IQ2_M | ~5GB | Functional but severely compressed (use only if necessary) |
Configuration in LM Studio:
Several alternatives may suit specific use cases or hardware configurations:
Qwen3: Demonstrates strong context handling and works well at extended context windows (tested up to 52K). Some users report needing around 24GB available RAM for optimal RooCode performance with Qwen models.
DeepSeek-Coder: Provides excellent explanations alongside code generation, running nearly instantly on RTX 4090 hardware. Users report it feels more helpful than Devstral or Codestral for understanding-driven workflows.
VLLM + Qwen3-32B: For users wanting maximum capability without quantization artifacts, this combination offers speed and capabilities comparable to commercial solutions when paired with sufficient GPU resources.
Mistral Small (IQ4): Reported as functional for coding tasks by some community members, though Devstral remains the recommended starting point.
The embedding model powers semantic search within docs-mcp-server, enabling accurate documentation lookup without bloating your context window with entire reference pages.
This compact embedding model provides excellent performance for documentation retrieval while remaining lightweight enough to run alongside your language model.
Model Name: text-embedding-snowflake-arctic-embed-l-v2.0
Loading in LM Studio:
Qwen3 0.6b Embedding: Recent releases suggest this smaller Qwen variant may outperform Arctic embeddings when properly configured, though real-world testing is ongoing.
The docs-mcp-server component bridges your local models with official documentation, grounding responses in verified sources and dramatically reducing hallucination rates for syntax and API usage.
Docker provides isolated execution with easier management through tools like Portainer.
Prerequisites:
MCP Configuration:
Add the following to your RooCode MCP settings file (mcp_settings.json):
{
"mcpServers": {
"docs-mcp-server": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"-p",
"6280:6280",
"-p",
"6281:6281",
"-e",
"OPENAI_API_KEY",
"-e",
"OPENAI_API_BASE",
"-e",
"DOCS_MCP_EMBEDDING_MODEL",
"-v",
"docs-mcp-data:/data",
"ghcr.io/arabold/docs-mcp-server:latest"
],
"env": {
"OPENAI_API_KEY": "ollama",
"OPENAI_API_BASE": "http://host.docker.internal:1234/v1",
"DOCS_MCP_EMBEDDING_MODEL": "text-embedding-snowflake-arctic-embed-l-v2.0"
}
}
}
}
If Docker is not available, the docs-mcp-server can run directly through Node.js:
{
"mcpServers": {
"docs-mcp-server": {
"command": "npx",
"args": [
"@arabold/docs-mcp-server@latest"
],
"env": {
"OPENAI_API_KEY": "ollama",
"OPENAI_API_BASE": "http://host.docker.internal:1234/v1",
"DOCS_MCP_EMBEDDING_MODEL": "text-embedding-snowflake-arctic-embed-l-v2.0"
}
}
}
}
Note: The OPENAI_API_KEY value of "ollama" is a placeholder; the server authenticates via the API base URL rather than actual key validation.
Once configured, teach your assistant about specific languages and frameworks by scraping their official documentation.
Required Parameters:
| Parameter | Description |
|---|---|
url |
Link to the documentation root or relevant section |
library |
Name of the programming language or library |
version |
Specific version string (enables accurate syntax for your version) |
Optional Parameters:
{
"maxPages": 1000, // Maximum pages to crawl (default)
"maxDepth": 3, // Navigation depth limit
"scope": "subpages", // Crawling boundary: 'subpages', 'hostname', or 'domain'
"followRedirects": true // Handle HTTP 3xx redirects automatically
}
Example Usage:
Please scrape the Flutter documentation using:
- URL: https://docs.flutter.dev/
- Library: flutter
- Version: 3.24.0
- MaxDepth: 4
When working with unfamiliar APIs or verifying syntax, instruct your assistant:
Search the docs for how to implement pagination in Flutter Riverpod v2.8.
The MCP server automatically retrieves relevant documentation and provides it as context, ensuring accurate implementation guidance without manual web searches.
The MCP configuration file (mcp_settings.json) is located within VSCode's settings directory:
%APPDATA%\Code\User\globalStorage\rooveterinaryinc.roo-cline\mcp_settings.json~/Library/Application Support/Code/User/globalStorage/rooveterinaryinc.roo-cline/mcp_settings.json~/.config/Code/User/globalStorage/rooveterinaryinc.roo-cline/mcp_settings.jsonZed offers a polished, performance-focused editing experience with AI integration:
Void is a VSCode fork with integrated AI features:
These tools offer alternative approaches to AI-assisted coding:
The 32K token context window works effectively when combined with docs-mcp-server for documentation lookup. The RAG (Retrieval-Augmented Generation) approach provides necessary details without consuming your context budget.
If you need more context: Qwen3 has been tested successfully at 52K tokens, though system requirements increase proportionally.
Lower quantizations save VRAM but impact quality:
For users with access to additional GPUs, consider adding a p104-100 (~$25 used) as a secondary card. This allows running higher-quality quantizations (IQ4 or better) without compromising your primary graphics workload.
For users wanting all services defined in a single configuration file:
version: '3.8'
services:
docs-mcp-server:
image: ghcr.io/arabold/docs-mcp-server:latest
ports:
- "6280:6280"
- "6281:6281"
environment:
- OPENAI_API_KEY=ollama
- OPENAI_API_BASE=http://host.docker.internal:1234/v1
- DOCS_MCP_EMBEDDING_MODEL=text-embedding-snowflake-arctic-embed-l-v2.0
volumes:
- docs-mcp-data:/data
volumes:
docs-mcp-data:
Note: LM Studio itself is a desktop application and does not run in Docker; this compose file only covers the MCP server component.
Embedding models are only accessible through the Developer tab, which is separate from the chat interface. This UI design often trips up new users—switch to Developer view to load embedding models.
OPENAI_API_BASE points to your local instance: http://host.docker.internal:1234/v1host.docker.internal may require network configurationIf tool calls fail frequently:
This setup provides a production-ready, self-hosted alternative to commercial AI coding assistants. The combination of agentic coding through RooCode, local inference via LM Studio, and documentation grounding through docs-mcp-server creates a system that learns from your project's actual dependencies rather than training data.
The stack remains flexible—swap the embedding model for newer alternatives, replace Devstral with Qwen3 or DeepSeek-Coder as they improve, or migrate to llama.cpp when you want minimal overhead. The OpenAI-compatible API ensures your editor configuration remains stable regardless of backend changes.
Regularly update your local models and documentation cache as new versions release. Languages evolve; keeping your documentation current prevents regressions in generated code quality.