This tutorial guides you through installing and setting up llama-swap and llama.cpp for running local LLMs.
Download llama.cpp from the releases page:
Go to https://github.com/ggml-org/llama.cpp/releases/tag/b8117 and download the appropriate release for your system.
Extract the downloaded archive to your chosen dir eg: /mnt/AI/:
tar -xzf llama.cpp-b8117-linux.tar.gz -C /mnt/AI/
Make the binary executable (if needed):
chmod +x /mnt/AI/llama-b8117/llama-server
Verify the installation:
/mnt/AI/llama-b8117/llama-server --help
Download llama-swap from the releases page:
Go to https://github.com/mostlygeek/llama-swap/releases and download the latest release.
Extract the downloaded archive to your chosen dir eg: /mnt/AI/:
tar -xzf llama-swap-linux.tar.gz -C /mnt/AI/
Make the binary executable:
chmod +x /mnt/AI/llama-swap/llama-swap
Create a config.yaml file in your llama-swap directory with the following example.
Note That I use LM-Studio to download models and point llama-swap to use that model folder
startPort: 8080
models:
"Qwen3-VL-30B-A3B-131k-Tnk-Heretic":
ttl: 600
cmd: |
/mnt/AI/llama-b8117/llama-server
-m "/mnt/AI/models/mradermacher/Qwen3-VL-30B-A3B-Thinking-Heretic-GGUF/Qwen3-VL-30B-A3B-Thinking-Heretic.Q8_0.gguf"
--mmproj "/mnt/AI/models/mradermacher/Qwen3-VL-30B-A3B-Thinking-Heretic-GGUF/Qwen3-VL-30B-A3B-Thinking-Heretic.mmproj-f16.gguf"
--host 0.0.0.0
--port ${PORT}
-ngl -1
-t -1
-c 131072
--tensor-split 56,43
--flash-attn on
--temp 0.6
--top-p 0.95
--min-p 0.01
--top-k 20
"Qwen3-Coder-30B-A3B-256k-NoTnk":
ttl: 600
cmd: |
/mnt/AI/llama-b8117/llama-server
-m "/mnt/AI/models/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ4_NL.gguf"
--host 0.0.0.0 --port ${PORT}
-ngl -1
-c 128000
--tensor-split 56,43
--flash-attn auto
--temp 0.7
--top_p 0.8
--top_k 20
--repeat_penalty 1.05
"Qwen3-Next-384E-Ablit-Instruct-262k":
ttl: 1200
cmd: |
/mnt/AI/llama-b8117/llama-server
-m "/mnt/AI/models/mradermacher/Qwen3-Next-384E-Abliterated-Instruct-i1-GGUF/Qwen3-Next-384E-Abliterated-Instruct.i1-IQ4_XS.gguf"
--host 0.0.0.0
--port ${PORT}
-ngl -1
-t -1
-c 262144
--tensor-split 56,43
--flash-attn on
--temp 0.6
--top-p 0.95
--min-p 0.01
--top-k 20
Important
- The
--mmprojoption in the first config example is usually required for Vision models (like Qwen-VL) to work properly. This points to the multimodal projection file (.mmproj) that enables vision capabilities.
Configuration Notes:
Create a shell script (start_llamaswap.sh) with the following content:
#!/usr/bin/bash
# Launch llama-server with config file
/mnt/AI/llama-swap/llama-swap --config /mnt/AI/llama-swap/config.yaml --listen 0.0.0.0:8080 --watch-config
Make it executable:
chmod +x start_llamaswap.sh
Run the script:
./start_llamaswap.sh
Create a desktop entry file (llama-swap_Launcher.desktop) with the following content:
[Desktop Entry]
Type=Application
Name=LLama Swap
Exec=/mnt/AI/llama-swap/llama-swap --config /mnt/AI/llama-swap/config.yaml --listen 0.0.0.0:8080 --watch-config
Categories=Education;Development;AI
Icon=
Comment=Run llama-swap server in terminal.
Path=
Terminal=true
StartupNotify=false
Make it executable:
chmod +x llama-swap_Launcher.desktop
Place it in ~/.local/share/applications/ to use it from your application menu.
The server will start on http://localhost:8080/ by default. You can verify it's running by checking:
curl http://localhost:8080/v1/models
You should see a list of configured models.
startPort in config.yaml or kill existing processes on port 8080