Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu
A D700-specific guide to running llama.cpp with Vulkan on the 2013 Mac Pro: dual 6 GB FirePro cards, Ubuntu, RADV, full GPU offload, cooling, and the traps that make old GCN hardware look slower than it is.
May 26, 202612 min read
Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu
The 2013 Mac Pro is still a strange machine: thermally dense, beautifully overbuilt, and awkwardly dependent on two workstation GPUs that most modern ML stacks have forgotten. The D700 version is the most interesting one for local LLM work because it gives you dual AMD FirePro D700 cards with 6 GB of GDDR5 each.
That is 12 GB of aggregate VRAM, but it is not a single 12 GB GPU. Treat it as two separate 6 GB pools that llama.cpp can use well when the Vulkan backend is configured correctly.
The practical outcome is simple: the D700 machine can comfortably run the class of models that are annoying on a D300. Seven billion parameter Q4 models become realistic with useful context sizes. Thirteen billion parameter models are still a poor fit if you expect full GPU offload, because the Mac Pro's dual cards do not behave like one contiguous accelerator.
This guide is a D700-specific rewrite of Edward Chalupa's excellent D300 guide. The main flow is the same: Ubuntu, the amdgpu kernel driver, Mesa RADV, llama.cpp built with Vulkan, and a few settings that matter much more than they look.
Hardware target
Apple shipped three GPU tiers in the Mac Pro 6,1. The D700 is the top configuration: each card has 6 GB of GDDR5, 2048 stream processors, a 384-bit memory bus, and 264 GB/s of memory bandwidth.
GPU
Architecture family
VRAM per card
Aggregate VRAM
Practical llama.cpp target
FirePro D300
GCN 1.0 / Pitcairn-class
2 GB
4 GB
3B and small 4B models
FirePro D500
GCN 1.0 / Tahiti-class
3 GB
6 GB
4B and some compact 7B quants
FirePro D700
GCN 1.0 / Tahiti-class
6 GB
12 GB
7B Q4/Q5, sometimes 8B Q4
The important difference is not raw TFLOPS. It is memory headroom. A 7B Q4_K_M GGUF is usually around 4.0-4.5 GB before runtime buffers and KV cache. On a D300 that is a non-starter. On a D700 pair, layer splitting gives the model enough room.
What fits
Use these as planning numbers, not promises. Exact memory depends on architecture, quantization, context size, batch settings, and llama.cpp version.
Model class
Quant
Typical GGUF size
D700 verdict
3B
Q8_0
~3.0-3.5 GB
Easy, but underuses the hardware
7B
Q4_K_M
~4.0-4.5 GB
Good default target
7B
Q5_K_M
~5.0-5.5 GB
Good with conservative context
8B
Q4_K_M
~4.5-5.0 GB
Usually workable
13B
Q4_K_M
~7.5-8.5 GB
Usually not worth it on this bus
The trap is reading "12 GB VRAM" as "anything under 12 GB fits." It does not. llama.cpp can distribute layers across devices, but each card still has a 6 GB ceiling and the runtime needs additional memory for compute buffers and KV cache.
Why a 13B Q4 model is awkward
Model weights + buffers + KV cache
+----------------------------------+
| more than one D700 can hold well |
+----------------------------------+
Splitting helps with layers, but the old PCIe path and sync cost
make CPU/GPU mixed inference unattractive once full offload fails.
For this machine, optimize for models that fully offload. If the model does not fit with --n-gpu-layers 99, the fallback should usually be CPU-only, not partial offload.
The driver stack
The D700 is old GCN hardware. The old radeon kernel driver can drive displays, but it is the wrong foundation for Vulkan inference. You want this stack:
llama-server
|
| GGML Vulkan backend
v
Mesa RADV Vulkan driver
|
| userspace Vulkan implementation
v
Linux amdgpu kernel driver
|
v
Dual FirePro D700 GPUs
Mesa documents RADV as the Vulkan driver for AMD GCN/RDNA GPUs, with the caveat that GCN 1-2 hardware may need amdgpu explicitly enabled instead of radeon. Ubuntu 24.04 often does the right thing on this Mac Pro, but you should verify rather than assume.
For a working D700 setup you should see two RADV devices. They may be labelled as RADV TAHITI, AMD FirePro D700, or similar depending on Mesa and kernel versions.
If vulkaninfo sees one card, fix that before building llama.cpp. llama.cpp can only use devices exposed by the Vulkan loader.
Step 3: install llama.cpp with Vulkan
You have two good options here. Start with the prebuilt Vulkan release unless you specifically need a local patch, a known commit, or a custom compiler setup.
Option A: download the prebuilt Vulkan binary
llama.cpp publishes release builds on GitHub, including an Ubuntu x64 Vulkan package. Download the latest one from the releases page:
https://github.com/ggml-org/llama.cpp/releases
Look for:
Linux -> Ubuntu x64 (Vulkan)
On the machine itself, you can fetch the newest Ubuntu x64 Vulkan tarball with the GitHub API:
The extracted archive contains the runnable binaries. Depending on the release layout, they may be directly under the extracted directory rather than under build/bin. Confirm where llama-server landed:
find /opt/llama.cpp -type f -name "llama-server" -print
Use that path in the systemd unit below. If it prints /opt/llama.cpp/build/bin/llama-server, the later examples can be used unchanged.
Option B: build from source
Build from source when you want a specific commit or want to prove exactly which backend options are compiled in:
One thing worth noting if you are new to llama cpp is the --model option. If you omit this then it'll now start in router mode where it attempts to make available any models you have locally, when you first try to use one via the web ui it'll load it into memory and get it ready. However, if you are using a CLI harness like Pi, this doesn't know to tell the server to unload the model when you switch to a new one and will probably crash the server. To avoid that you can add the --models-max 1
The two settings that look optional but are not:
Setting
Why it matters
GGML_VK_VISIBLE_DEVICES=0,1
Keeps both D700s visible to llama.cpp
--split-mode layer
Lets llama.cpp distribute transformer layers across the two GPUs
--threads 2
Avoids wasting CPU on sync-heavy Vulkan submission
RADV_PERFTEST=aco,gpl
Uses RADV's faster shader compiler and pipeline path
Do not blindly set --threads to the number of Xeon threads. Once all layers are on the GPUs, extra CPU threads mostly wait on Vulkan synchronization. On this machine, high thread counts can make the desktop feel broken without improving tokens per second.
Step 5: make it a service
Create a dedicated model directory and service user if you want this machine to be an always-on endpoint. Then create:
Then confirm VRAM is actually being used on both cards:
for card in /sys/class/drm/card*/device/mem_info_vram_used; doprintf"%s: ""$card"
awk '{ printf "%.1f MiB\n", $1 / 1024 / 1024 }'"$card"done
The exact numbers depend on the model, but both D700s should move substantially above idle after the model loads.
Cooling matters
The Mac Pro 6,1 has one thermal core and one fan. That design is elegant until both GPUs sit under sustained compute load. Install macfanctld and make the fan curve less timid:
Under sustained inference, you want stable temperatures, not silence. The D700s have more memory headroom than the D300s, but they also put more heat into the same small chassis.
Things to avoid
Flash attention
Do not assume --flash-attn helps. GCN 1.0 predates the FP16 throughput assumptions that make flash attention compelling on modern hardware. Test it if you want, but make the default "off" until benchmarks prove otherwise.
# Baseline first
./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2
# Only then compare
./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2 --flash-attn
Partial offload
Avoid half-on-GPU, half-on-CPU configurations for models that exceed VRAM:
# Prefer this when it fits
--n-gpu-layers 99
# Prefer this when it does not fit
--n-gpu-layers 0
# Be suspicious of this on the Mac Pro 6,1
--n-gpu-layers 20
The D700 cards are connected through an old workstation design, not a modern high-bandwidth multi-GPU fabric. Once inference has to bounce across CPU and GPU layers, the bus and synchronization overhead can erase the benefit of acceleration.
Giant context windows
The D700 memory budget looks generous until you increase context. KV cache grows with context size, layer count, embedding size, and cache precision.
VRAM pressure = model weights + compute buffers + KV cache
KV cache roughly grows with:
context length x number of layers x hidden size x cache precision
Start at --ctx-size 4096. Move to 8192 only after watching VRAM on both cards during real prompts. You can alternatively just remove this option and allow llama cpp to decide for you, it'll pick the maximum it can fit in what VRAM is left over from loading the model.
Do not compare llama-bench directly to llama-server under real API traffic. The server has slot management, sampling, tokenization, and HTTP overhead. Use bench numbers to compare configurations, not to see production throughput.
The use case for these machines
The D700 Mac Pro is not a cheap alternative to a H100 and it is not a modern gaming GPU box (although, it can actually run very well not Vulkan is enabled). Its still useful though, despite it being a bit power hungry compared to modern options:
Use case
Fit
Local coding assistant fallback
Good with a 7B Q4/Q5 model
Private summarization endpoint
Good with conservative context
Multi-user chat service
Poor
13B+ experimentation
CPU-only or use newer hardware
Always-on home lab inference
Good if power cost is acceptable
The point of the D700 is not that it wins benchmarks. It is that a sunk-cost workstation can still be a reliable local inference endpoint when the model is sized correctly and the Vulkan path is configured well.
One this worth thinking about however is the running costs, these old machines can suck up 250-300w under full load, so if you are doing full time inference on them it might actually be cheaper to get a Codex / Claude subscription. You do the math and do whats best for you.
Chief Technology Officer writing about AI systems, software architecture, cyber security, cryptography, and the practical realities of technology leadership.