Installing llama-swap and llama.cpp

This tutorial guides you through installing and setting up llama-swap and llama.cpp for running local LLMs.

Prerequisites

Linux operating system
Sufficient storage space for models (recommended: 20GB+ free space)
Python 3.10+ or compatible system environment

Step 1: Install llama.cpp

Download llama.cpp from the releases page:
Go to https://github.com/ggml-org/llama.cpp/releases/tag/b8117 and download the appropriate release for your system.
Extract the downloaded archive to your chosen dir eg: /mnt/AI/:
```
tar -xzf llama.cpp-b8117-linux.tar.gz -C /mnt/AI/
```

Make the binary executable (if needed):

chmod +x /mnt/AI/llama-b8117/llama-server

Verify the installation:
```
/mnt/AI/llama-b8117/llama-server --help
```

Step 2: Install llama-swap

Download llama-swap from the releases page:
Go to https://github.com/mostlygeek/llama-swap/releases and download the latest release.
Extract the downloaded archive to your chosen dir eg: /mnt/AI/:
```
tar -xzf llama-swap-linux.tar.gz -C /mnt/AI/
```
Make the binary executable:
```
chmod +x /mnt/AI/llama-swap/llama-swap
```

Step 3: Configure llama-swap

Create a config.yaml file in your llama-swap directory with the following example.
Note That I use LM-Studio to download models and point llama-swap to use that model folder

startPort: 8080
models:
  "Qwen3-VL-30B-A3B-131k-Tnk-Heretic":
    ttl: 600
    cmd: |
      /mnt/AI/llama-b8117/llama-server
      -m "/mnt/AI/models/mradermacher/Qwen3-VL-30B-A3B-Thinking-Heretic-GGUF/Qwen3-VL-30B-A3B-Thinking-Heretic.Q8_0.gguf"
      --mmproj "/mnt/AI/models/mradermacher/Qwen3-VL-30B-A3B-Thinking-Heretic-GGUF/Qwen3-VL-30B-A3B-Thinking-Heretic.mmproj-f16.gguf"
      --host 0.0.0.0
      --port ${PORT}
      -ngl -1
      -t -1
      -c 131072
      --tensor-split 56,43
      --flash-attn on
      --temp 0.6
      --top-p 0.95
      --min-p 0.01
      --top-k 20
  "Qwen3-Coder-30B-A3B-256k-NoTnk":
    ttl: 600
    cmd: |
      /mnt/AI/llama-b8117/llama-server
      -m "/mnt/AI/models/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ4_NL.gguf"
      --host 0.0.0.0 --port ${PORT}
      -ngl -1
      -c 128000
      --tensor-split 56,43
      --flash-attn auto
      --temp 0.7
      --top_p 0.8
      --top_k 20
      --repeat_penalty 1.05
  "Qwen3-Next-384E-Ablit-Instruct-262k":
    ttl: 1200
    cmd: |
      /mnt/AI/llama-b8117/llama-server
      -m "/mnt/AI/models/mradermacher/Qwen3-Next-384E-Abliterated-Instruct-i1-GGUF/Qwen3-Next-384E-Abliterated-Instruct.i1-IQ4_XS.gguf"
      --host 0.0.0.0
      --port ${PORT}
      -ngl -1
      -t -1
      -c 262144
      --tensor-split 56,43
      --flash-attn on
      --temp 0.6
      --top-p 0.95
      --min-p 0.01
      --top-k 20

Important

The --mmproj option in the first config example is usually required for Vision models (like Qwen-VL) to work properly. This points to the multimodal projection file (.mmproj) that enables vision capabilities.

Configuration Notes:

Model Name: Can be anything that helps you identify which model and settings you are using
- startPort: The starting port for the llama-swap server (default: 8080)
- -m models: The model file on disk to load
- ttl: Time-to-live in seconds before model unloads (in seconds)
- cmd: The command to launch the llama-server with all required parameters
- options: Any of the llama-server --help list of options can be used here

Step 4: Launch the llama-swap Server

Option 1: Using Shell Script

Create a shell script (start_llamaswap.sh) with the following content:

#!/usr/bin/bash

# Launch llama-server with config file
/mnt/AI/llama-swap/llama-swap --config /mnt/AI/llama-swap/config.yaml --listen 0.0.0.0:8080 --watch-config

Make it executable:

chmod +x start_llamaswap.sh

Run the script:

./start_llamaswap.sh

Option 2: Using Desktop Launcher

Create a desktop entry file (llama-swap_Launcher.desktop) with the following content:

[Desktop Entry]
Type=Application
Name=LLama Swap
Exec=/mnt/AI/llama-swap/llama-swap --config /mnt/AI/llama-swap/config.yaml --listen 0.0.0.0:8080 --watch-config
Categories=Education;Development;AI
Icon=
Comment=Run llama-swap server in terminal.
Path=
Terminal=true
StartupNotify=false

Make it executable:

chmod +x llama-swap_Launcher.desktop

Place it in ~/.local/share/applications/ to use it from your application menu.

Step 5: Verify Installation

The server will start on http://localhost:8080/ by default. You can verify it's running by checking:

curl http://localhost:8080/v1/models

You should see a list of configured models.

Troubleshooting

Port already in use: Change startPort in config.yaml or kill existing processes on port 8080
Model not found: Verify the model paths in config.yaml match actual file locations
Permission denied: Ensure the llama-swap binary has execute permissions
Connection refused: Ensure the server is running and the config file path is correct

Configuration Details

llama-swap: Manages multiple llama.cpp models and provides an OpenAI-compatible API
llama.cpp: The underlying engine that runs the quantized models
config.yaml: Defines which models to load and how to launch them
--watch-config: Automatically reloads configuration when changes are detected