llama.cpplocal AIMac ProFirePro D700UbuntuVulkanAMD

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

A D700-specific guide to running llama.cpp with Vulkan on the 2013 Mac Pro: dual 6 GB FirePro cards, Ubuntu, RADV, full GPU offload, cooling, and the traps that make old GCN hardware look slower than it is.

May 26, 202612 min read

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

The 2013 Mac Pro is still a strange machine: thermally dense, beautifully overbuilt, and awkwardly dependent on two workstation GPUs that most modern ML stacks have forgotten. The D700 version is the most interesting one for local LLM work because it gives you dual AMD FirePro D700 cards with 6 GB of GDDR5 each.

Run llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu | Matthew Gribben

Mac Pro 6,1 D700 memory shape

             llama.cpp Vulkan backend
                       |
              split-mode: layer
                       |
        +--------------+--------------+
        |                             |
  FirePro D700 0                 FirePro D700 1
  Tahiti / GCN 1.0               Tahiti / GCN 1.0
  6 GB GDDR5                     6 GB GDDR5

GPU	Architecture family	VRAM per card	Aggregate VRAM	Practical llama.cpp target
FirePro D300	GCN 1.0 / Pitcairn-class	2 GB	4 GB	3B and small 4B models
FirePro D500	GCN 1.0 / Tahiti-class	3 GB	6 GB	4B and some compact 7B quants
FirePro D700	GCN 1.0 / Tahiti-class	6 GB	12 GB	7B Q4/Q5, sometimes 8B Q4

Model class	Quant	Typical size	D700 verdict
3B	Q8_0	~3.0-3.5 GB	Easy, but underuses the hardware
7B	Q4_K_M	~4.0-4.5 GB	Good default target
7B	Q5_K_M	~5.0-5.5 GB	Good with conservative context
8B	Q4_K_M	~4.5-5.0 GB	Usually workable
13B	Q4_K_M	~7.5-8.5 GB	Usually not worth it on this bus
35B A3B	Q4_K_M	~5.6-6 GB VRAM 21 GB RAM	Works quite well

Why a 13B Q4 model is awkward

  Model weights + buffers + KV cache
  +----------------------------------+
  | more than one D700 can hold well |
  +----------------------------------+

  Splitting helps with layers, but the old PCIe path and sync cost
  make CPU/GPU mixed inference unattractive once full offload fails.

llama-server
  |
  |  GGML Vulkan backend
  v
Mesa RADV Vulkan driver
  |
  |  userspace Vulkan implementation
  v
Linux amdgpu kernel driver
  |
  v
Dual FirePro D700 GPUs

lspci -nnk | grep -A3 -E "VGA|Display|FirePro|AMD"

Kernel driver in use: amdgpu

sudoedit /etc/default/grub

radeon.si_support=0 amdgpu.si_support=1

sudo update-grub
sudo reboot

sudo apt update
sudo apt install -y \
  build-essential \
  cmake \
  curl \
  git \
  glslc \
  libvulkan-dev \
  mesa-vulkan-drivers \
  spirv-headers \
  vulkan-tools

vulkaninfo --summary

Expected shape, not exact text:

Devices:
  GPU0: RADV TAHITI / AMD FirePro D700
  GPU1: RADV TAHITI / AMD FirePro D700

https://github.com/ggml-org/llama.cpp/releases

Linux -> Ubuntu x64 (Vulkan)

mkdir -p /opt/llama.cpp
cd /opt/llama.cpp

release_url=$(
  curl -fsSL https://api.github.com/repos/ggml-org/llama.cpp/releases/latest |
    grep "browser_download_url" |
    grep "ubuntu-vulkan-x64.tar.gz" |
    cut -d '"' -f 4
)

curl -L "$release_url" -o llama-vulkan.tar.gz
tar -xzf llama-vulkan.tar.gz

find /opt/llama.cpp -type f -name "llama-server" -print

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build \
  -DGGML_VULKAN=ON \
  -DLLAMA_CURL=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j"$(nproc)"

./build/bin/llama-server --list-devices

GGML_VK_VISIBLE_DEVICES=0,1 \
RADV_PERFTEST=aco,gpl \
./build/bin/llama-server \
  --model /models/qwen2.5-7b-instruct-q4_k_m.gguf \
  --n-gpu-layers 99 \
  --split-mode layer \
  --threads 2 \
  --parallel 1 \
  --host 0.0.0.0 \
  --port 8088

Setting	Why it matters
`GGML_VK_VISIBLE_DEVICES=0,1`	Keeps both D700s visible to llama.cpp
`--split-mode layer`	Lets llama.cpp distribute transformer layers across the two GPUs
`--threads 2`	Avoids wasting CPU on sync-heavy Vulkan submission
`RADV_PERFTEST=aco,gpl`	Uses RADV's faster shader compiler and pipeline path

sudoedit /etc/systemd/system/llama-server.service

[Unit]
Description=llama.cpp Vulkan inference server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=llama
WorkingDirectory=/opt/llama.cpp
=
=
=/opt/llama.cpp/build/bin/llama-server \
  --model /srv/models/qwen2.5-7b-instruct-q4_k_m.gguf \
  --n-gpu-layers 99 \
  --split-mode layer \
  --threads 2 \
  --parallel 1 \
  --host 0.0.0.0 \
  --port 8080
=-failure
=


=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now llama-server
sudo systemctl status llama-server

curl http://localhost:8080/health

for card in /sys/class/drm/card*/device/mem_info_vram_used; do
  printf "%s: " "$card"
  awk '{ printf "%.1f MiB\n", $1 / 1024 / 1024 }' "$card"
done

sudo apt install -y macfanctld
sudoedit /etc/macfanctl.conf

fan_min: 1200
temp_avg_floor: 45
temp_avg_ceiling: 58
log_level: 1

sudo systemctl restart macfanctld
sudo tail -f /var/log/macfanctl.log

# Baseline first
./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2

# Only then compare
./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2 --flash-attn

# Prefer this when it fits
--n-gpu-layers 99

# Prefer this when it does not fit
--n-gpu-layers 0

# Be suspicious of this on the Mac Pro 6,1
--n-gpu-layers 20

VRAM pressure = model weights + compute buffers + KV cache

KV cache roughly grows with:
  context length x number of layers x hidden size x cache precision

sudo systemctl stop llama-server

cat /sys/class/drm/card*/device/mem_info_vram_used

GGML_VK_VISIBLE_DEVICES=0,1 RADV_PERFTEST=aco,gpl \
./build/bin/llama-bench \
  -m /srv/models/qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 99 \
  -t 2 \
  -c 4096

load_backend: loaded RPC backend from /home/altitudelabs/llama-b9305/libggml-rpc.so
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon R9 200 / HD 7900 Series (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon R9 200 / HD 7900 Series (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/altitudelabs/llama-b9305/libggml-vulkan.so
load_backend: loaded CPU backend from /home/altitudelabs/llama-b9305/libggml-cpu-ivybridge.so
Downloading Qwopus3.5-9B-Coder-MTP-Q4_K_M.gguf ───────────────────── 100%
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium        |   5.37 GiB |     9.20 B | Vulkan     |  99 |       2 |           pp512 |         40.11 ± 0.30 |
| qwen35 9B Q4_K - Medium        |   5.37 GiB |     9.20 B | Vulkan     |  99 |       2 |           tg128 |         18.85 ± 0.02 |

Run	Model	Quant	Context	Threads	Flash attention	Decode tok/s
1	Qwopus3.5-9B-Coder-MTP	Q4_K_M	4096	2	off	18.85
2	Qwen3.5-9B-MTP	Q4_K_XL	4096	2	off	9.17
3	Qwen3.5-9B-MTP	Q4_K_M	4096	2	off	19.04
4	Qwen2.5-Coder-7B-Instruct	Q4_K_M	4096	2	off	21.39
5	Qwen3.6-35B-A3B-MTP-GGUF	Q4_K_M	4096	2	off	7.04

Use case	Fit
Local coding assistant fallback	Good with a 7B Q4/Q5 model
Private summarization endpoint	Good with conservative context
Multi-user chat service	Poor
13B+ experimentation	CPU-only or use newer hardware
Always-on home lab inference	Good if power cost is acceptable

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

Hardware target

What fits

The driver stack

Step 1: verify both GPUs use amdgpu

Step 2: install and test Vulkan

Step 3: install llama.cpp with Vulkan

Option A: download the prebuilt Vulkan binary

Option B: build from source

Step 4: run for full offload

Step 5: make it a service

Cooling matters

Things to avoid

Flash attention

Partial offload

Giant context windows

Benchmarking

The use case for these machines

References

Matthew Gribben