Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu
The 2013 Mac Pro is still a strange machine: thermally dense, beautifully overbuilt, and awkwardly dependent on two workstation GPUs that most modern ML stacks have forgotten. The D700 version is the most interesting one for local LLM work because it gives you dual AMD FirePro D700 cards with 6 GB of GDDR5 each.
That is 12 GB of aggregate VRAM, but it is not a single 12 GB GPU. Treat it as two separate 6 GB pools that llama.cpp can use well when the Vulkan backend is configured correctly.
Mac Pro 6,1 D700 memory shape
llama.cpp Vulkan backend
|
split-mode: layer
|
+--------------+--------------+
| |
FirePro D700 0 FirePro D700 1
Tahiti / GCN 1.0 Tahiti / GCN 1.0
6 GB GDDR5 6 GB GDDR5
The practical outcome is simple: the D700 machine can comfortably run the class of models that are annoying on a D300. Seven billion parameter Q4 models become realistic with useful context sizes. Thirteen billion parameter models are still a poor fit if you expect full GPU offload, because the Mac Pro's dual cards do not behave like one contiguous accelerator.
This guide is a D700-specific rewrite of Edward Chalupa's excellent D300 guide. The main flow is the same: Ubuntu, the amdgpu kernel driver, Mesa RADV, llama.cpp built with Vulkan, and a few settings that matter much more than they look.
Hardware target
Apple shipped three GPU tiers in the Mac Pro 6,1. The D700 is the top configuration: each card has 6 GB of GDDR5, 2048 stream processors, a 384-bit memory bus, and 264 GB/s of memory bandwidth.
| GPU | Architecture family | VRAM per card | Aggregate VRAM | Practical llama.cpp target |
|---|---|---|---|---|
| FirePro D300 | GCN 1.0 / Pitcairn-class | 2 GB | 4 GB | 3B and small 4B models |
| FirePro D500 | GCN 1.0 / Tahiti-class | 3 GB | 6 GB | 4B and some compact 7B quants |
| FirePro D700 | GCN 1.0 / Tahiti-class | 6 GB | 12 GB | 7B Q4/Q5, sometimes 8B Q4 |
The important difference is not raw TFLOPS. It is memory headroom. A 7B Q4_K_M GGUF is usually around 4.0-4.5 GB before runtime buffers and KV cache. On a D300 that is a non-starter. On a D700 pair, layer splitting gives the model enough room.
What fits
Use these as guide numbers. Exact memory depends on architecture, quantization, context size, batch settings, and llama.cpp version.
| Model class | Quant | Typical size | D700 verdict |
|---|---|---|---|
| 3B | Q8_0 | ~3.0-3.5 GB | Easy, but underuses the hardware |
| 7B | Q4_K_M | ~4.0-4.5 GB | Good default target |
| 7B | Q5_K_M | ~5.0-5.5 GB | Good with conservative context |
| 8B | Q4_K_M | ~4.5-5.0 GB | Usually workable |
| 13B | Q4_K_M | ~7.5-8.5 GB | Usually not worth it on this bus |
| 35B A3B | Q4_K_M | ~5.6-6 GB VRAM 21 GB RAM | Works quite well |
The mistake is reading "12 GB VRAM" as "anything under 12 GB fits." It does not. llama.cpp can distribute layers across devices, but each card still has a 6 GB ceiling and the runtime needs additional memory for compute buffers and KV cache.
Why a 13B Q4 model is awkward
Model weights + buffers + KV cache
+----------------------------------+
| more than one D700 can hold well |
+----------------------------------+
Splitting helps with layers, but the old PCIe path and sync cost
make CPU/GPU mixed inference unattractive once full offload fails.
For this machine, optimize for models that fully offload. If the model does not fit with --n-gpu-layers 99, the fallback should usually be CPU-only, not partial offload.
The driver stack
The D700 is old GCN hardware. The old radeon kernel driver can drive displays, but it is the wrong foundation for Vulkan inference. You want this stack:
llama-server
|
| GGML Vulkan backend
v
Mesa RADV Vulkan driver
|
| userspace Vulkan implementation
v
Linux amdgpu kernel driver
|
v
Dual FirePro D700 GPUs
Mesa documents RADV as the Vulkan driver for AMD GCN/RDNA GPUs, with the caveat that GCN 1-2 hardware may need amdgpu explicitly enabled instead of radeon. Ubuntu 24.04 often does the right thing on this Mac Pro, but you should verify rather than assume.
Step 1: verify both GPUs use amdgpu
Start with PCI detection:
lspci -nnk | grep -A3 -E "VGA|Display|FirePro|AMD"
You want both D700 devices to report:
Kernel driver in use: amdgpu
If either card is bound to radeon, add the Southern Islands amdgpu flags:
sudoedit /etc/default/grub
Set or extend GRUB_CMDLINE_LINUX_DEFAULT:
radeon.si_support=0 amdgpu.si_support=1
Then update GRUB and reboot:
sudo update-grub
sudo reboot
After reboot, check again. Do not continue until both cards are on amdgpu.
Step 2: install and test Vulkan
Install the Vulkan userspace pieces and the headers llama.cpp needs during build:
sudo apt update
sudo apt install -y \
build-essential \
cmake \
curl \
git \
glslc \
libvulkan-dev \
mesa-vulkan-drivers \
spirv-headers \
vulkan-tools
Now check what Vulkan sees:
vulkaninfo --summary
For a working D700 setup you should see two RADV devices. They may be labelled as RADV TAHITI, AMD FirePro D700, or similar depending on Mesa and kernel versions.
Expected shape, not exact text:
Devices:
GPU0: RADV TAHITI / AMD FirePro D700
GPU1: RADV TAHITI / AMD FirePro D700
If vulkaninfo sees one card, fix that before building llama.cpp. llama.cpp can only use devices exposed by the Vulkan loader.
Step 3: install llama.cpp with Vulkan
You have two good options here. Start with the prebuilt Vulkan release unless you specifically need a local patch, a known commit, or a custom compiler setup.
Option A: download the prebuilt Vulkan binary
llama.cpp publishes release builds on GitHub, including an Ubuntu x64 Vulkan package. Download the latest one from the releases page:
https://github.com/ggml-org/llama.cpp/releases
Look for:
Linux -> Ubuntu x64 (Vulkan)
On the machine itself, you can fetch the newest Ubuntu x64 Vulkan tarball with the GitHub API:
mkdir -p /opt/llama.cpp
cd /opt/llama.cpp
release_url=$(
curl -fsSL https://api.github.com/repos/ggml-org/llama.cpp/releases/latest |
grep "browser_download_url" |
grep "ubuntu-vulkan-x64.tar.gz" |
cut -d '"' -f 4
)
curl -L "$release_url" -o llama-vulkan.tar.gz
tar -xzf llama-vulkan.tar.gz
The extracted archive contains the runnable binaries. Depending on the release layout, they may be directly under the extracted directory rather than under build/bin. Confirm where llama-server landed:
find /opt/llama.cpp -type f -name "llama-server" -print
Use that path in the systemd unit below. If it prints /opt/llama.cpp/build/bin/llama-server, the later examples can be used unchanged.
Option B: build from source
Build from source when you want a specific commit or want to prove exactly which backend options are compiled in:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_VULKAN=ON \
-DLLAMA_CURL=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
Confirm the binary can see backend devices:
./build/bin/llama-server --list-devices
If your llama.cpp build is older and does not expose --list-devices, use a short llama-cli smoke test and read the startup log for ggml_vulkan.
Step 4: run for full offload
The default D700 command should be something like:
GGML_VK_VISIBLE_DEVICES=0,1 \
RADV_PERFTEST=aco,gpl \
./build/bin/llama-server \
--model /models/qwen2.5-7b-instruct-q4_k_m.gguf \
--n-gpu-layers 99 \
--split-mode layer \
--threads 2 \
--parallel 1 \
--host 0.0.0.0 \
--port 8088
One thing worth noting if you are new to llama cpp is the --model option. If you omit this then it'll now start in router mode where it attempts to make available any models you have locally, when you first try to use one via the web ui it'll load it into memory and get it ready. However, if you are using a CLI harness like Pi, this doesn't know to tell the server to unload the model when you switch to a new one and will probably crash the server. To avoid that you can add the --models-max 1
The two settings that look optional but are not:
| Setting | Why it matters |
|---|---|
GGML_VK_VISIBLE_DEVICES=0,1 | Keeps both D700s visible to llama.cpp |
--split-mode layer | Lets llama.cpp distribute transformer layers across the two GPUs |
--threads 2 | Avoids wasting CPU on sync-heavy Vulkan submission |
RADV_PERFTEST=aco,gpl | Uses RADV's faster shader compiler and pipeline path |
Do not blindly set --threads to the number of Xeon threads. Once all layers are on the GPUs, extra CPU threads mostly wait on Vulkan synchronization. On this machine, high thread counts can make the desktop feel broken without improving tokens per second.
Step 5: make it a service
Create a dedicated model directory and service user if you want this machine to be an always-on endpoint. Then create:
sudoedit /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp Vulkan inference server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=llama
WorkingDirectory=/opt/llama.cpp
=
=
=/opt/llama.cpp/build/bin/llama-server \
--model /srv/models/qwen2.5-7b-instruct-q4_k_m.gguf \
--n-gpu-layers 99 \
--split-mode layer \
--threads 2 \
--parallel 1 \
--host 0.0.0.0 \
--port 8080
=-failure
=
=multi-user.target
Remember to omit the --model option if you want it to run in router mode
Enable it:
sudo systemctl daemon-reload
sudo systemctl enable --now llama-server
sudo systemctl status llama-server
Check the HTTP endpoint:
curl http://localhost:8080/health
Then confirm VRAM is actually being used on both cards:
for card in /sys/class/drm/card*/device/mem_info_vram_used; do
printf "%s: " "$card"
awk '{ printf "%.1f MiB\n", $1 / 1024 / 1024 }' "$card"
done
The exact numbers depend on the model, but both D700s should move substantially above idle after the model loads.
Cooling matters
The Mac Pro 6,1 has one thermal core and one fan. That design is elegant until both GPUs sit under sustained compute load. Install macfanctld and make the fan curve less timid:
sudo apt install -y macfanctld
sudoedit /etc/macfanctl.conf
A reasonable starting point:
fan_min: 1200
temp_avg_floor: 45
temp_avg_ceiling: 58
log_level: 1
Restart and watch the log:
sudo systemctl restart macfanctld
sudo tail -f /var/log/macfanctl.log
Under sustained inference, you want stable temperatures, not silence. The D700s have more memory headroom than the D300s, but they also put more heat into the same small chassis.
Things to avoid
Flash attention
Do not assume --flash-attn helps. GCN 1.0 predates the FP16 throughput assumptions that make flash attention compelling on modern hardware. Test it if you want, but make the default "off" until benchmarks prove otherwise.
# Baseline first
./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2
# Only then compare
./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2 --flash-attn
Partial offload
Avoid half-on-GPU, half-on-CPU configurations for models that exceed VRAM:
# Prefer this when it fits
--n-gpu-layers 99
# Prefer this when it does not fit
--n-gpu-layers 0
# Be suspicious of this on the Mac Pro 6,1
--n-gpu-layers 20
The D700 cards are connected through an old workstation design, not a modern high-bandwidth multi-GPU fabric. Once inference has to bounce across CPU and GPU layers, the bus and synchronization overhead can erase the benefit of acceleration.
Giant context windows
The D700 memory budget looks generous until you increase context. KV cache grows with context size, layer count, embedding size, and cache precision.
VRAM pressure = model weights + compute buffers + KV cache
KV cache roughly grows with:
context length x number of layers x hidden size x cache precision
Start at --ctx-size 4096. Move to 8192 only after watching VRAM on both cards during real prompts. You can alternatively just remove this option and allow llama cpp to decide for you, it'll pick the maximum it can fit in what VRAM is left over from loading the model.
Benchmarking
Stop the service before benchmarking:
sudo systemctl stop llama-server
Confirm the cards are back near idle:
cat /sys/class/drm/card*/device/mem_info_vram_used
Then benchmark one variable at a time:
GGML_VK_VISIBLE_DEVICES=0,1 RADV_PERFTEST=aco,gpl \
./build/bin/llama-bench \
-m /srv/models/qwen2.5-7b-instruct-q4_k_m.gguf \
-ngl 99 \
-t 2 \
-c 4096
load_backend: loaded RPC backend from /home/altitudelabs/llama-b9305/libggml-rpc.so
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon R9 200 / HD 7900 Series (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon R9 200 / HD 7900 Series (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/altitudelabs/llama-b9305/libggml-vulkan.so
load_backend: loaded CPU backend from /home/altitudelabs/llama-b9305/libggml-cpu-ivybridge.so
Downloading Qwopus3.5-9B-Coder-MTP-Q4_K_M.gguf ───────────────────── 100%
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.37 GiB | 9.20 B | Vulkan | 99 | 2 | pp512 | 40.11 ± 0.30 |
| qwen35 9B Q4_K - Medium | 5.37 GiB | 9.20 B | Vulkan | 99 | 2 | tg128 | 18.85 ± 0.02 |
Record:
| Run | Model | Quant | Context | Threads | Flash attention | Decode tok/s |
|---|---|---|---|---|---|---|
| 1 | Qwopus3.5-9B-Coder-MTP | Q4_K_M | 4096 | 2 | off | 18.85 |
| 2 | Qwen3.5-9B-MTP | Q4_K_XL | 4096 | 2 | off | 9.17 |
| 3 | Qwen3.5-9B-MTP | Q4_K_M | 4096 | 2 | off | 19.04 |
| 4 | Qwen2.5-Coder-7B-Instruct | Q4_K_M | 4096 | 2 | off | 21.39 |
| 5 | Qwen3.6-35B-A3B-MTP-GGUF | Q4_K_M | 4096 | 2 | off | 7.04 |
Do not compare llama-bench directly to llama-server under real API traffic. The server has slot management, sampling, tokenization, and HTTP overhead. Use bench numbers to compare configurations, not to see production throughput.
The use case for these machines
The D700 Mac Pro is not a cheap alternative to a H100 and it is not a modern gaming GPU box (although, it can actually run very well not Vulkan is enabled). Its still useful though, despite it being a bit power hungry compared to modern options:
| Use case | Fit |
|---|---|
| Local coding assistant fallback | Good with a 7B Q4/Q5 model |
| Private summarization endpoint | Good with conservative context |
| Multi-user chat service | Poor |
| 13B+ experimentation | CPU-only or use newer hardware |
| Always-on home lab inference | Good if power cost is acceptable |
The point of the D700 is not that it wins benchmarks. It is that a sunk-cost workstation can still be a reliable local inference endpoint when the model is sized correctly and the Vulkan path is configured well.
One this worth thinking about however is the running costs, these old machines can suck up 250-300w under full load, so if you are doing full time inference on them it might actually be cheaper to get a Codex / Claude subscription. You do the math and do whats best for you.