You guys that continue to compare DGX Spark to the Mac Studios, please remember two things:
1. Virtually every model that you'd run was developed on Nvidia gear and will run on Spark.
2. Spark has fast-as-hell interconnects. The sort of interconnects that one would want to use in an actual AI DC, so you can use more than one Spark at the same time, and RDMA, and actually start to figure out how things work the way they do and why. You can do a lot with 200 Gb of interconnect.
It isn't that good for local LLM inferencing. It's not designed to be as such.
It's designed to be a local dev machine for Nvidia server products. It has the same software and hardware stack as enterprise Nvidia hardware. That's what it is designed for.
Wait for M5 series Macs for good value local inferencing. I think the M5 Pro/Max are going to be very good values.
This is insanely slow given its 200+GB/s memory bandwidth. As a comparison, I've tested GPT OSS 120B on Strix Halo and it obtains 420tps prefill and >40tps decode.
I wonder why they didn't test against the broadly available Strix Halo with 128GB of 256 GB/s memory bandwidth, 16 core full-fat Zen5 with AVX512 at $2k... it is a mystery...
Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.
Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.
Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.
Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.
Your system RAM is probably 1/20th the VRAM bandwidth of the 5090 (way way less than half) unless you're running a workstation board with quad or 8 channel RAM, then it's only about 1/10th or 1/5th respectively.
We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.
Strix Halo has the problem that prefill is incredibly slow if your context is not very small.
The only thing that might be interesting about this DGX Spark is it's prefill manages to be faster due to better compute. I haven't compared the numbers yet, but they are included in the article.
I think my 2001 MBP M1 Pro is ~200GB/s memory bandwidth, but it handles qwen3:32b quite nicely, albeit maxed out at ~70W.
I somehow expected the Spark to be the 'God in a Box' moment for local AI, but it feels like they went for trying to sell multiple units instead.
I'd be more tempted by a 2nd hand 128GB M2 ultra at ~800GB/s but the prices here are still high, and I'm not sure the Spark is going to convince people to part with those, unless we see some M5 glutenous RAM boxes soon. An easy way for Apple to catch up again.
They're in a different ballback in memory bandwidth. The right comparison is the Ryzen AI Max 395 with 128GB DDR5-8000 which can be bought for around $1800 / 1750€.
$4,000 is actually extremely competitive. Even for an at-home enthusiast setup this price is not our of reach. I was expecting something far higher, that said, nVidia's MSRP is something of a pipe dream recently so we'll see when it's actually released and the availability. Curious also to see how they may scale together.
Well, that’s disappointing since the Mac Studio 128GB is $3,499. If Apple happens to launch a Mac Mini with 128GB RAM it would eat Nvidia Sparks’ lunch every day.
Only if it runs CUDA, MLX / Metal isn't comparable as ecosystem.
People that keep pushing for Apple gear tend to forget Apple has decided what industry considers industry standards, proprietary or not, aren't made available on their hardware.
Even if Metal is actually a cool API to program for.
It depends what you're doing. I can get valuable work done with the subset of Torch supported on MPS and I'm grateful for the speed and RAM of modern Mac systems. JAX support is worse but hopefully both continue to develop.
FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:
Is this the full weight model or quantized version? The GGUFs distributed on Hugging Face labeled as MXFP4 quantization have layers that are quantized to int8 (q8_0) instead of bf16 as suggested by OpenAI.
Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:
Still, a PC with a 5090 will give in many cases a much better bang for the buck, except when limited by the slower speed of the main memory.
The greater bandwidth available when accessing the entire 128 GB memory is the only advantage of NVIDIA DGX, while a cheaper PC with discrete GPU has a faster GPU, a faster CPU and a faster local GPU memory.
Msrp, but try getting your hands on one without a bulk order and/or camping out in a tent all weekend. I have seen people in my area buying pre-biult machines as they often cost less than trying to buy an individual card.
It’s not that hard to come across MSRP 5090s these days. It took me about a week before I found one. But if you don’t want to put any effort or waiting into it, you can buy one of the overpriced OC models right now for $2500.
That memory bandwidth choked out their performance. How can you claim 1000 tflops if it's not capable of delivering it. Seems they chose to sandbag the spark in favour of the rtx pro 6000.
I guess my next one I'm looking out for is the Orange Pi AI studio pro. Should have 192gb of ram, so able to run qwen3 235b, even though it's ddr4, it's nearly double the bandwidth of the spark.
You guys that continue to compare DGX Spark to the Mac Studios, please remember two things:
1. Virtually every model that you'd run was developed on Nvidia gear and will run on Spark. 2. Spark has fast-as-hell interconnects. The sort of interconnects that one would want to use in an actual AI DC, so you can use more than one Spark at the same time, and RDMA, and actually start to figure out how things work the way they do and why. You can do a lot with 200 Gb of interconnect.
It would be very interesting to read a tutorial on case 2.
It isn't that good for local LLM inferencing. It's not designed to be as such.
It's designed to be a local dev machine for Nvidia server products. It has the same software and hardware stack as enterprise Nvidia hardware. That's what it is designed for.
Wait for M5 series Macs for good value local inferencing. I think the M5 Pro/Max are going to be very good values.
> ollama gpt-oss 120b mxfp4 1 94.67 11.66
This is insanely slow given its 200+GB/s memory bandwidth. As a comparison, I've tested GPT OSS 120B on Strix Halo and it obtains 420tps prefill and >40tps decode.
I wonder why they didn't test against the broadly available Strix Halo with 128GB of 256 GB/s memory bandwidth, 16 core full-fat Zen5 with AVX512 at $2k... it is a mystery...
Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that none of my friend has this device.
Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.
Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.
Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.
Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.
Your system RAM is probably 1/20th the VRAM bandwidth of the 5090 (way way less than half) unless you're running a workstation board with quad or 8 channel RAM, then it's only about 1/10th or 1/5th respectively.
We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.
The parent is right, the issue is on your side.
Strix Halo has the problem that prefill is incredibly slow if your context is not very small.
The only thing that might be interesting about this DGX Spark is it's prefill manages to be faster due to better compute. I haven't compared the numbers yet, but they are included in the article.
There are some benches on reddit: https://old.reddit.com/r/LocalLLaMA/comments/1o6163l/dgx_spa...
tl;dr it gets absolutely smashed by Strix Halo, at half the price.
I think my 2001 MBP M1 Pro is ~200GB/s memory bandwidth, but it handles qwen3:32b quite nicely, albeit maxed out at ~70W.
I somehow expected the Spark to be the 'God in a Box' moment for local AI, but it feels like they went for trying to sell multiple units instead.
I'd be more tempted by a 2nd hand 128GB M2 ultra at ~800GB/s but the prices here are still high, and I'm not sure the Spark is going to convince people to part with those, unless we see some M5 glutenous RAM boxes soon. An easy way for Apple to catch up again.
Article doesn't seem to mention price which is $4,000 which makes it comparable to a 5090 but with 128GB of unified LPDDR5x vs the 5090's 32GB DDR7.
They're in a different ballback in memory bandwidth. The right comparison is the Ryzen AI Max 395 with 128GB DDR5-8000 which can be bought for around $1800 / 1750€.
$4,000 is actually extremely competitive. Even for an at-home enthusiast setup this price is not our of reach. I was expecting something far higher, that said, nVidia's MSRP is something of a pipe dream recently so we'll see when it's actually released and the availability. Curious also to see how they may scale together.
A warning to any home consumer throwing money at hardware for AI (fair enough if you have other use cases)...
Things are changing rapidly and there is a non insignificant chance that it'll seem like a big waste of money within 12 months.
Well, that’s disappointing since the Mac Studio 128GB is $3,499. If Apple happens to launch a Mac Mini with 128GB RAM it would eat Nvidia Sparks’ lunch every day.
Only if it runs CUDA, MLX / Metal isn't comparable as ecosystem.
People that keep pushing for Apple gear tend to forget Apple has decided what industry considers industry standards, proprietary or not, aren't made available on their hardware.
Even if Metal is actually a cool API to program for.
CUDA is equally proprietary and not an industry standard though, unless you were thinking of Vulcan/OpenCL which doesn’t bring much in this situation.
Yes it is an industry standard, there is even a technical term for it.
It is called De facto standard, which you can check in your favourite dictionary.
CUDA isn't the industry standard? What is then?
It depends what you're doing. I can get valuable work done with the subset of Torch supported on MPS and I'm grateful for the speed and RAM of modern Mac systems. JAX support is worse but hopefully both continue to develop.
Just don't try to run a NCCL
Agreed. I also wonder why they chose to test against a Mac Studio with only 64GB instead of 128GB.
Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that one of my friend has this device.
FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:
Is this the full weight model or quantized version? The GGUFs distributed on Hugging Face labeled as MXFP4 quantization have layers that are quantized to int8 (q8_0) instead of bf16 as suggested by OpenAI.
Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...
Example looking at the same weight on Ollama is BF16:
https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360
Now this looks much more interesting! Is the top one input tokens and the second one output tokens?
So 38.54 t/s on 120B? Have you tested filling the context too?
I see! Do you know what's causing the slowdown for ollama? They should be using the same backend..
Dude, ggerganov is the creator of llama.cpp. Kind of a legend. And of course he is right, you should've used llama.cpp.
Or you can just ask the ollama people about the ollama problems. Ollama is (or was) just a Go wrapper around llama.cpp.
Was. They've been diverging.
Curious to how this compares to running on a Mac.
TTFT on a Mac is terrible and only increases as the context increases, thats why many are selling their M3 Ultra 512GB
A 5090 is $2000.
But you put in a $1500 PC (with 128 GB DRAM).
Still, a PC with a 5090 will give in many cases a much better bang for the buck, except when limited by the slower speed of the main memory.
The greater bandwidth available when accessing the entire 128 GB memory is the only advantage of NVIDIA DGX, while a cheaper PC with discrete GPU has a faster GPU, a faster CPU and a faster local GPU memory.
Msrp, but try getting your hands on one without a bulk order and/or camping out in a tent all weekend. I have seen people in my area buying pre-biult machines as they often cost less than trying to buy an individual card.
It’s not that hard to come across MSRP 5090s these days. It took me about a week before I found one. But if you don’t want to put any effort or waiting into it, you can buy one of the overpriced OC models right now for $2500.
And about 1/4 the memory bandwidth, which is what matters for inference.
More precisely, the RTX 5090 has a memory bandwidth of 1792 GB/s, while the DGX Spark only has 273 GB/s, which is about 1/6.5.
For inference, the DGX Spark does not look like a good choice, as there are cheaper alternatives with better performance.
Nvidia always short changes its own products and stunts them in some way.
No doubt that’s present here too somehow.
Gotta cut off something important so you’ll spend more on the next more expensive product.
How representative is this platform of the bigger GB200 and GB300 chips?
Could I write code that runs on Spark and effortlessly run it on a big GB300 system with no code changes?
If you mean CUDA specific then yes. The biggest benefit of these machines over the others is the CUDA ecosystem and tools like cuDF, cuGraph etc
Looks like MLX is not a supported backend in Ollama so the numbers for the Mac could be significantly higher in some cases.
It would be interesting to swap out Ollama for LM Studio and use their built-in MLX support and see the difference.
M5 Macs may be launching as early as today. Inference should see a significant boost w/ matmul acceleration.
That memory bandwidth choked out their performance. How can you claim 1000 tflops if it's not capable of delivering it. Seems they chose to sandbag the spark in favour of the rtx pro 6000.
I guess my next one I'm looking out for is the Orange Pi AI studio pro. Should have 192gb of ram, so able to run qwen3 235b, even though it's ddr4, it's nearly double the bandwidth of the spark.
[dead]
Just get 5 Mac minis.