Three Months with the DGX Spark

AI-Generated Summary

"This is a masterpiece of technical storytelling."
— Gemini 3 Pro

Gemini 3 Pro genuinely calling the article 'a masterpiece of technical storytelling'

This is both a technical review of the DGX Spark and a reflection on what this machine reveals about three interconnected problems: how we communicate complex technology, how we train experts, and how we preserve generational knowledge.

The early criticism was harsh. Reports say Jensen Huang personally got involved to guide the team's response. But my experience was different. The DGX Spark became the foundation for OrthoSystem.ai—a platform I'm building to preserve institutional knowledge for orthopaedic surgery training before the generation of surgeons who carry it retires.

Writing my own benchmarks, I found more strengths than weaknesses. Technical best practices are shared throughout this review. NVIDIA promised that the DGX Spark would let developers scale up to enterprise deployments and my reality exceeded expectations. Two months after getting my hands on the Spark, I took the challenging, fully proctored NVIDIA Certified Professional exams for Agentic AI and AI Infrastructure and passed both on the first try with essentially zero preparation.

The DGX Spark is a jack of all trades, master of none. But most people forget the complete saying: oftentimes better than a master of one.

The full review details nuanced pros and cons, and whether the DGX Spark may or may not be the right fit for your needs. But since OrthoSystem.ai wouldn't exist without this development environment, the product earns five stars from me—with the explicit caveat that I'm an invested user who bought multiple units and built my project on this hardware. The limiting factor isn't the hardware. It's time: I'm building on nights and weekends, racing against a timeline of vanishing knowledge.

When Jensen Huang shares the story of NVIDIA once being less than 30 days from bankruptcy in their early days, he's talking about the failed NV2 project, which used an innovative quadratic texture mapping with a 9-point control architecture that today could trivially be implemented with modern 5th generation Blackwell tensor cores. If the industry had adopted their approach, it would have built a defensive moat for their technology. However, the earliest NVIDIA products relied on fixed-function hardware and were difficult for software developers. The lessons from that failure clearly sparked the move toward developer-friendly programmability that led to the GeForce 3, CUDA and the Vera Rubin architecture and beyond. Recently, Business Insider reported that Jensen Huang had to step-in after customer criticism of the DGX Spark launch.

While I'm a professor of orthopaedic spine surgery today, I once professionally reviewed graphics cards and GPUs in high school, college and beyond—including running what was then the largest ATI enthusiast site in the late 90s and later publishing surgical finite element analysis research using dual AMD Opteron workstations. That public report of the NVIDIA NV2 was something I wrote as an undergraduate at Stanford University. Tom's Hardware's exclusive interview with Ian Buck in 2009 on the advent of CUDA was something I did on the side while in my orthopaedic surgery residency.

This dual perspective of covering GPU technology from AMD (ATI), NVIDIA, and Intel from its earliest days along with two decades since I received my medical degree shapes this review. I understand the hardware at a systems level. I also understand what it means to train the next generation of practitioners in a field where expertise takes a decade to develop and institutional knowledge seems to be retiring faster than it transfers.

Specifications

At CES 2025, Jensen Huang announced Project Digits, a "desktop AI supercomputer" at $3,000. Ten months later, the DGX Spark shipped at $3,999 with partners like ASUS offering the $3,000 version (Ascent GX10). The specs tell an ambitious story:

SoC	GB10 (Grace CPU + Blackwell GPU)
CPU	20 core Arm (10 Cortex-X925 + 10 Cortex-A725)
GPU	48 SMs (Blackwell Architecture)
Memory	128 GB unified LPDDR5X (no ECC)
DRAM Bandwidth	273 GB/s theoretical
C2C Interconnect	600 GB/s aggregate
Networking	10GbE + ConnectX-7 200GbE

There was a lot of hemming-and-hawing at first because the DGX Spark lacks ECC. My desktop PC has 256GB of DDR5 ECC system RAM and one option would be to get a RTX PRO 6000 Blackwell with 96GB of ECC GDDR7 for my AI efforts. But I'd tracked my dual RTX A2000 12GB cards for a week and logged zero memory errors so I concluded that perhaps my office is well-shielded from cosmic rays (one of the surprisingly common reasons for memory errors).

While debating a purchase of the DGX Spark, I explored LM Studio's basic Nomic-based retrieval augmented generation (RAG), NVIDIA's Chat with RTX technology demo, and Anything LLM with my desktop PC. I soon realized that none of the readily available RAG/LLM toolkits gave me the performance I wanted. Speed wasn't the issue. The problem was that retrieval quality was "pretty good" but pretty good allowed bad retrievals to enter the context window which affected the quality of the response. Just like a small number of samples can poison an LLM during training, if there are some errors in RAG retrieval, true but irrelevant retrievals can impact the direction or quality of the response.

The DGX Spark now looked like a Nintendo Switch to me. If I was going to spend the time to develop my own software, would it work reliably across everyone else's computers? Was I going to have to set up Windows Subsystem Linux (WSL2) for everyone and make sure they had a RTX 5090 (which was so big that my workstation chassis runs with the side panel off)? With something like the DGX Spark, I would have a fixed, guaranteed performance spec and standardized development and distribution environment.

So on a gamble, I bought a DGX Spark.

First Impressions

Three things struck me immediately: The size. The silence. The golden lattice.

Photos don't really do the DGX Spark Founders Edition justice. While the cardboard shipping box is big and bulky, the unit itself is tiny. Without a power LED, the machine is so unintrusive that you might forget that it's running. Under heavy compute load, the fans do accelerate but it's quieter than my desktop PC, an Intel Xeon w7-2475X "Sapphire Rapids" with a NVIDIA RTX 5090.

DGX Spark Founders Edition with golden lattice finish — The DGX Spark Founders Edition alongside "The Measurement of Meaning" -- a fitting companion for a machine built to understand language.

128 GB

Unified Memory · No Staging Buffer · CPU + GPU Shared

Building OrthoSystem

What I discovered over three months of development is that the DGX ecosystem is remarkably reliable.

When I was working with WSL2 and something breaks, I wouldn't be sure if I was making some obvious mistake or if it was a problem with my code and implementation. On the DGX Spark, I start with the assumption that the hardware, drivers, and CUDA stack are correct. This sounds minor, but for productivity it's transformative. On less mature platforms, debugging means questioning every layer. On the Spark, I trusted the foundation and focused on my application. This is a bit of blind trust; a very early software update from NVIDIA caused the DGX Spark to fail to boot. But ever since the arrival of my DGX Spark two days after launch, I have not had any catastrophic failures.

That's not to say that it's bug free. Run out of memory? Your system will lock up or crash/reboot. Using a HDMI television as your display? Sometimes the display wake-from-sleep doesn't work right, and you have to unplug/replug the HDMI cable. Using a cheap generic KVM from Amazon? Constant USB disconnection/reconnection can cause USB ports to become unresponsive until the next boot. (No problems with an IOGEAR TAA-compliant KVM.) Still, when I assumed that the toolchain does what it's supposed to, the iterative process of programming moved so quickly that I was able to write the first 25,000 lines of code for OrthoSystem.ai in under 2 months.

It was actually AI assistance that I had to be skeptical of. When the DGX Spark launched, all of the major LLM services would insist that there was no such thing as CUDA 13 or confuse the DGX Spark GB10 chip with GH100 or GH200 style tools. When working with new latest Q4 2025 frontier open source models, AI would often suggest that I have a typo since no such model existed. I think I fortuitously came into the CUDA 13 world without issue since I had already gone through the challenge of CUDA versions not being forward or backward compatible when doing astrophotography during the first year of the COVID-19 pandemic and getting my AI based deconvolution to work.

The Ecosystem Advantage

After two months working with the DGX Spark, I saw an advertisement that NVIDIA's certification tests were 50% off. These are formal, proctored multiple choice exams intended to provide "tangible evidence of your expertise, proficiency, and commitment to continuous learning." As a surgeon who hadn't been involved in the tech industry for a long time, I figured that some sort of certification was needed to distinguish me from every other clinician "getting into the AI space." NVIDIA offers entry-level "associate" and intermediate level "professional" certifications.

With the mindset that we ask our trainees to do multiple choice exams to identify weaknesses for areas of study, I took the more difficult NVIDIA Professional tests blind with no preparation. The combination of my prior advanced hobbyist level of computing experience along with the real-world experience troubleshooting and developing my own RAG/LLM on the DGX Spark allowed me to pass all of the tests on my first attemps. Working with the DGX Spark on nights and weekends, really did accelerate the acquisition of enterprise-grade expertise because the Spark runs the same software stack as NVIDIA's datacenter hardware.

NVIDIA Certified Professional - Agentic AI

NVIDIA Certified Professional - AI Infrastructure

Those two months working in an environment where failure is an option (since I wasn't burning up cloud compute costs or risking a production environment) meant that I could really consolidate a lot of hands on experience in a short period of time. Officially, the Agentic AI certification has the prerequisites of: 1–2 years of experience in AI/ML roles and hands-on work with production-level agentic AI projects. Strong knowledge of agent development, architecture, orchestration, multi-agent frameworks, and the integration of tools and models across various platforms. Experience with evaluation, observability, deployment, user interface design, reliability guardrails, and rapid prototyping platforms is also essential for ensuring robust and scalable agentic AI solutions. AI infrastructure's prerequisites? Two to three years of operational experience working in a datacenter with NVIDIA hardware solutions. The candidate should be able to deploy all the parts of a datacenter infrastructure in support of AI workloads.

Community Feedback vs. My Experience

The early reviews were harsh. As I mentioned earlier, Business Insider recently reported internal emails showing Jensen directing staff to "jump on X and say you will fix." Researchers conducting cancer work found CUDA incompatibility issues. Most notably, John Carmack, the id Software co-founder whose graphics programming defined the beginning of the 3D graphics era in gaming, posted on X that his unit was "maxing out at only 100 watts power draw, less than half of the rated 240 watts" and "delivering about half the quoted performance." He noted thermal issues and asked whether NVIDIA had "de-rated" the system before launch. Awni Hannun, lead developer of Apple's MLX framework, confirmed similar results in his own benchmarks.

Carmack's criticism is well founded. The headline "1 petaflop" figure assumes FP4 with 2:4 structured sparsity. This is a technique that doubles effective throughput but only applies to certain matrix operations. Dense BF16 workloads see roughly half that figure. John Carmack is working on AGI, and when he likely benchmarked dense BF16 matrix operations, he's measuring a specific workload that wasn't my specific path. My workloads, RAG pipelines, embedding generation, reranking, and serving quantized models travel different paths through the silicon. I haven't experienced the thermal throttling or spontaneous reboots. Whether that's workload differences, firmware updates since October, or unit variance, I can't say definitively.

Writing My Own Benchmarks

While the DGX Spark didn't work for John Carmack's needs, it was working well for me. So, I bought a second unit to expand the capabilities. Coming home from Micro Center with a $100 network cable led me to eagerly try to see "just how fast" the hardware was. That decision prompted me to write some benchmarks myself.

Memory Bandwidth: The Foundation

Everything in modern AI computing bottlenecks on memory bandwidth. Here's what the GB10 actually achieves:

121 GB/s

CPU STREAM Copy

213 GB/s

GPU Kernel (DRAM)

882+ GB/s

Cache-to-Cache

The CPU result, 121 GB/s, represents 44% of theoretical bandwidth, solid for a unified architecture where CPU and GPU contend for the same memory controllers. The 882 GB/s cache-to-cache result reveals the cache hierarchy's power. Workloads that fit in the 24 MB L2 cache see dramatically higher effective bandwidth.

The System Level Cache Sweet Spot

The GB10's System Level Cache (SLC) creates a performance sweet spot that I haven't seen documented elsewhere:

Buffer Size	Bandwidth	Behavior
1 MB	481 GB/s	Cache-resident
4 MB	882 GB/s	Peak (sweet spot)
8 MB	611 GB/s	L2→SLC transition
16 MB	270 GB/s	Spills to DRAM

The 4 MB range achieves nearly 150% of the theoretical 600 GB/s C2C interconnect bandwidth through cache-to-cache transfers with L2 persistence. Below 1 MB, kernel launch overhead dominates. Above 16 MB, data spills to DRAM. This matters for embedding lookups, attention computations, and reranking passes.

The Unified Memory Advantage

On discrete GPUs, forgetting to pin your host allocations costs 6× performance. On the Spark, pageable and pinned memory perform identically--because there's no staging buffer and no separate memory pools.

Host→Device Transfer: DGX Spark vs. RTX 5090

128 MB buffer · Pageable vs. Pinned memory

The RTX 5090 shows a 6× difference between pageable and pinned transfers. The DGX Spark? About 3% variation. This eliminates an entire category of optimization work--memory staging strategy simply doesn't matter.

200GbE Networking

This capability gets lost in the coverage, partly because it's hard to explain without context.

45.2 GB/s

NCCL AllGather · 2048 MB · 90.5% efficiency (full duplex)

Two DGX Sparks connected via 200GbE form a 256 GB distributed cluster with sub-10ms collective latencies. That's faster than most workstations achieve to local NVMe storage.

Network Bandwidth in Context

Bidirectional bandwidth · Why 200GbE matters for distributed computing

(These are bidirectional numbers--what you actually get when data flows both ways simultaneously. Thunderbolt 5's "80 Gbps" marketing speed is only for displays. That can only travel at 64 Gbps, or 8GB/sec each direction. When I first asked Claude Opus 4.5 to make this SVG graph, it incorrectly thought put Thunderbolt 5 at 80 Gbps as it was trained on marketing materials that don't clarify the distinction.)

This is why high-end workstations can't replicate distributed training workflows. The networking capability is genuinely unique at this price point, but this was lost on early reviews of the DGX Spark without understanding what NCCL collectives do and why latency matters. That's a hard story to tell without a reviewer's guide.

Another Discovery: gcc vs. nvcc

Three weeks in, I noticed CPU-bound preprocessing running slower than expected. Investigation revealed something I haven't seen documented elsewhere.

Compiler	STREAM Triad (2 GB)	Ratio
GCC -O3 -mcpu=native	116 GB/s	Baseline
nvcc -Xcompiler=-O3,-mcpu=native	42 GB/s	2.8× slower
nvcc (no flags)	~7 GB/s	~17× slower

The initial assumption was that nvcc simply wasn't forwarding optimization flags to the host compiler. The obvious fix: use -Xcompiler=-O3,-mcpu=native to pass flags explicitly. But testing revealed this only partially recovers performance--a 2.8× gap remains even with proper flags.

The root cause is probably nvcc's preprocessing stage. Before invoking GCC, nvcc transforms the source code to separate host and device paths. This transformation disrupts the code patterns that GCC's SVE vectorizer recognizes. The preprocessed code simply doesn't vectorize as effectively, regardless of what flags you pass.

I want to be humble here: I am not a professional software developer and this might be well-known to any production engineer, though Claude Opus 4.5 tells me it's genuinely underdocumented and Gemini 3 Pro just says "-Xcompiler" is the only thing you need. The actual fix requires compiling CPU-intensive code in separate .cpp files with native GCC, then linking against the CUDA executable. Discovering this required systematic benchmarking; accepting the conventional wisdom would have left 2.8× performance on the table.

Update Jan 16, 2026: After some debugging, I identified the problem. It's a limitation of the stable release of CMake/Ubuntu 24.04 on the DGX Spark. The NVCC path wasn't properly implementing OpenMP which meant that the NVCC path was running on a single CPU core as opposed to having access to all 20 cores. I used "target_link_libraries" and "OpenMP::OpenMP_CXX" as the method of linking the OpenMP library and saw no errors or warnings.

Claude Opus 4.5, Google Gemini 3 Pro, and OpenAI ChatGPT 5.1 Codex Max all failed to identify the problem when I asked them again today with some hints. According to NVIDIA, there will be a future system update via the DGX Dashboard that will update the packages to align with a newer Ubuntu version, including a more recent version of CMake. This will allow programers to use the target_link_libraries approach with OpenMP::OpenMP_CUDA. Until then, NVIDIA recommends using the compiler flags as a workaround, specifically adding the `-Xcompiler=-fopenmp` flag to ensure proper implementation of OpenMP with nvcc.

Detailed Benchmark Results

For those interested in the complete data (before the OpenMP fix), here are my benchmark results across CPU, GPU, and unified memory:

OrthoSystem DDx v21 — CPU, GPU, and Unified Memory Benchmarks (click to expand)


╔═══════════════════════════════════════════════════════════╗
║             OrthoSystem DDx v21 by Alan Dang             ║
╚═══════════════════════════════════════════════════════════╝

╔═══════════════════════════════════════════════════════════╗
║ PLATFORM                                                  ║
╠═══════════════════════════════════════════════════════════╣
║ GPU: NVIDIA GB10                                          ║
║ Memory: 122570 MB                                         ║
║ GPUDirect RDMA: NO                                        ║
╠═══════════════════════════════════════════════════════════╣
║ MEMORY: TRUE UNIFIED (CPU+GPU share physical RAM)         ║
╚═══════════════════════════════════════════════════════════╝

╔═══════════════════════════════════════════════════════════╗
║              OrthoSystem DDx v21                          ║
║      Compiler Comparison: GCC vs NVCC+Xcompiler           ║
╚═══════════════════════════════════════════════════════════╝

╔═══════════════════════════════════════════════════════════╗
║   CPU BENCHMARKS - NATIVE COMPILER (GCC/Clang -O3)        ║
╚═══════════════════════════════════════════════════════════╝
  CPU:     NVIDIA Grace (Neoverse V2)
  SIMD:    SVE
  Mode:    Native build (-march=native)
  Threads: 20 (20 cores, no SMT)

┌─ STREAM COPY (c = a) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      248.6 GB/s [  12.7- 851.1] ±360.5   91.1%
  4 MB     1013.0 GB/s [ 963.8-1067.8] ± 30.0  371.1%
  8 MB      394.2 GB/s [ 305.6- 438.7] ± 36.0  144.4%
  32 MB     150.7 GB/s [ 101.1- 181.9] ± 29.4   55.2%
  128 MB    124.6 GB/s [ 121.4- 129.2] ±  2.7   45.6%
  512 MB    119.1 GB/s [ 114.5- 122.1] ±  2.6   43.6%
  1024 MB   118.5 GB/s [ 116.0- 121.6] ±  2.2   43.4%
  2048 MB   120.8 GB/s [ 119.1- 122.6] ±  1.4   44.3%

┌─ STREAM SCALE (c = scalar * a) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      602.4 GB/s [  13.2- 851.1] ±321.5  220.7%
  4 MB      993.4 GB/s [ 921.4-1057.0] ± 41.1  363.9%
  8 MB      349.7 GB/s [ 288.3- 403.0] ± 27.6  128.1%
  32 MB     188.0 GB/s [ 179.9- 199.2] ±  7.1   68.9%
  128 MB    123.4 GB/s [ 117.4- 129.1] ±  3.7   45.2%
  512 MB    116.9 GB/s [ 114.6- 119.3] ±  1.6   42.8%
  1024 MB   118.8 GB/s [ 116.9- 121.3] ±  1.4   43.5%
  2048 MB   120.4 GB/s [ 119.1- 121.8] ±  0.9   44.1%

┌─ STREAM ADD (c = a + b) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      556.6 GB/s [  25.1- 943.0] ±423.1  203.9%
  4 MB      572.7 GB/s [ 491.2- 704.1] ± 55.9  209.8%
  8 MB      227.6 GB/s [ 212.0- 240.1] ±  8.6   83.4%
  32 MB     143.8 GB/s [ 130.5- 152.6] ± 10.7   52.7%
  128 MB    112.0 GB/s [ 107.7- 116.3] ±  3.2   41.0%
  512 MB    112.8 GB/s [ 110.0- 115.6] ±  2.0   41.3%
  1024 MB   116.6 GB/s [ 115.7- 117.5] ±  0.6   42.7%
  2048 MB   117.6 GB/s [ 116.9- 118.8] ±  0.6   43.1%

┌─ STREAM TRIAD (c = a + scalar * b) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB       31.5 GB/s [  10.0-  37.2] ±  9.2   11.5%
  4 MB      622.1 GB/s [ 532.1- 730.9] ± 57.9  227.9%
  8 MB      319.1 GB/s [ 297.8- 332.4] ± 10.4  116.9%
  32 MB     143.4 GB/s [ 125.5- 149.5] ±  9.0   52.5%
  128 MB    112.4 GB/s [ 111.4- 113.8] ±  0.9   41.2%
  512 MB    114.1 GB/s [ 113.3- 114.8] ±  0.6   41.8%
  1024 MB   114.6 GB/s [ 112.2- 117.1] ±  1.6   42.0%
  2048 MB   116.4 GB/s [ 115.2- 117.5] ±  0.9   42.6%

╔═══════════════════════════════════════════════════════════╗
║   CPU BENCHMARKS - NVCC + XCOMPILER (for comparison)      ║
╚═══════════════════════════════════════════════════════════╝
  NOTE: Same code, compiled via nvcc -Xcompiler=-O3,-march=native
        Testing if nvcc preprocessing affects SVE/NEON codegen.

┌─ CPU MEMORY - VOLATILE BYTE COPY (sanity) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB        7.2 GB/s [   6.7-   7.4] ±  0.3    2.6%
  4 MB        7.1 GB/s [   6.7-   7.3] ±  0.2    2.6%
  8 MB        7.2 GB/s [   7.2-   7.3] ±  0.0    2.6%
  32 MB       7.1 GB/s [   7.1-   7.1] ±  0.0    2.6%
  128 MB      7.1 GB/s [   7.1-   7.1] ±  0.0    2.6%
  512 MB      7.1 GB/s [   7.1-   7.1] ±  0.0    2.6%
  1024 MB     7.1 GB/s [   7.1-   7.1] ±  0.0    2.6%
  2048 MB     7.1 GB/s [   7.1-   7.1] ±  0.0    2.6%

┌─ CPU MEMORY - LIBC MEMCPY (read+write) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      104.3 GB/s [  78.2- 137.9] ± 19.9   38.2%
  4 MB      114.9 GB/s [ 106.9- 118.2] ±  4.2   42.1%
  8 MB       82.3 GB/s [  74.8-  84.5] ±  3.8   30.2%
  32 MB      52.1 GB/s [  51.8-  52.3] ±  0.2   19.1%
  128 MB     47.7 GB/s [  47.6-  47.8] ±  0.1   17.5%
  512 MB     46.6 GB/s [  46.6-  46.7] ±  0.0   17.1%
  1024 MB    46.6 GB/s [  46.6-  46.6] ±  0.0   17.1%
  2048 MB    46.6 GB/s [  46.6-  46.6] ±  0.0   17.1%

┌─ CPU MEMORY - MEMSET (write-only) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB       91.0 GB/s [  70.2-  99.0] ± 10.6   33.3%
  4 MB       85.7 GB/s [  83.4-  86.9] ±  1.3   31.4%
  8 MB       79.0 GB/s [  77.6-  79.6] ±  0.7   28.9%
  32 MB      55.5 GB/s [  54.7-  55.8] ±  0.4   20.3%
  128 MB     48.4 GB/s [  48.3-  48.5] ±  0.1   17.7%
  512 MB     46.9 GB/s [  46.8-  46.9] ±  0.0   17.2%
  1024 MB    46.9 GB/s [  46.8-  46.9] ±  0.0   17.2%
  2048 MB    46.8 GB/s [  46.7-  46.8] ±  0.0   17.1%

┌─ CPU STREAMING COPY (NEON) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      107.7 GB/s [  63.3- 138.6] ± 27.2   39.5%
  4 MB      114.8 GB/s [ 101.3- 119.0] ±  6.8   42.0%
  8 MB       82.8 GB/s [  75.4-  85.1] ±  3.7   30.3%
  32 MB      51.7 GB/s [  51.3-  51.8] ±  0.2   18.9%
  128 MB     47.6 GB/s [  47.6-  47.7] ±  0.1   17.5%
  512 MB     46.7 GB/s [  46.6-  46.7] ±  0.0   17.1%
  1024 MB    46.6 GB/s [  46.6-  46.6] ±  0.0   17.1%
  2048 MB    46.6 GB/s [  46.5-  46.6] ±  0.0   17.1%

┌─ CPU STREAM-STYLE COPY (20T, 5 trials) ─────────────┐
  (Detected 20 logical, 10 physical cores)
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      102.4 GB/s [  93.9- 112.4] ±  6.7   37.5%
  4 MB       80.7 GB/s [  42.8-  96.9] ± 19.9   29.6%
  8 MB       82.0 GB/s [  77.9-  84.6] ±  2.3   30.0%
  32 MB      51.6 GB/s [  50.5-  52.2] ±  0.6   18.9%
  128 MB     47.2 GB/s [  47.2-  47.3] ±  0.0   17.3%
  512 MB     46.6 GB/s [  46.4-  46.7] ±  0.1   17.1%
  1024 MB    46.5 GB/s [  46.5-  46.6] ±  0.0   17.0%
  2048 MB    46.6 GB/s [  46.6-  46.6] ±  0.0   17.1%

┌─ CPU STREAM-STYLE SCALE (20T, 5 trials) ────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB       93.4 GB/s [  76.7- 103.9] ±  9.1   34.2%
  4 MB       83.3 GB/s [  69.7-  92.3] ±  9.5   30.5%
  8 MB       74.9 GB/s [  69.0-  77.2] ±  3.0   27.4%
  32 MB      47.5 GB/s [  42.7-  50.5] ±  3.4   17.4%
  128 MB     47.6 GB/s [  47.4-  47.8] ±  0.2   17.4%
  512 MB     46.9 GB/s [  46.8-  47.0] ±  0.1   17.2%
  1024 MB    46.4 GB/s [  46.1-  46.5] ±  0.1   17.0%
  2048 MB    46.4 GB/s [  46.4-  46.4] ±  0.0   17.0%

┌─ CPU STREAM-STYLE ADD (20T, 5 trials) ──────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB       91.1 GB/s [  71.7-  99.0] ± 10.2   33.4%
  4 MB       51.6 GB/s [  32.8-  76.7] ± 14.7   18.9%
  8 MB       60.4 GB/s [  55.1-  63.7] ±  3.3   22.1%
  32 MB      44.0 GB/s [  43.4-  44.3] ±  0.3   16.1%
  128 MB     42.4 GB/s [  42.1-  42.7] ±  0.2   15.5%
  512 MB     42.0 GB/s [  42.0-  42.0] ±  0.0   15.4%
  1024 MB    41.8 GB/s [  41.8-  41.9] ±  0.0   15.3%
  2048 MB    42.0 GB/s [  42.0-  42.1] ±  0.0   15.4%

┌─ CPU STREAM-STYLE TRIAD (20T, 5 trials) ────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB       86.8 GB/s [  74.9-  93.4] ±  6.6   31.8%
  4 MB       73.1 GB/s [  48.7-  84.4] ± 12.6   26.8%
  8 MB       43.7 GB/s [  25.6-  53.9] ± 10.9   16.0%
  32 MB      43.8 GB/s [  42.8-  44.6] ±  0.6   16.0%
  128 MB     42.6 GB/s [  42.5-  42.7] ±  0.1   15.6%
  512 MB     42.1 GB/s [  42.0-  42.1] ±  0.0   15.4%
  1024 MB    41.7 GB/s [  41.7-  41.8] ±  0.0   15.3%
  2048 MB    42.0 GB/s [  41.9-  42.0] ±  0.1   15.4%

╔═══════════════════════════════════════════════════════════╗
║                   GPU MEMORY BENCHMARKS                   ║
╚═══════════════════════════════════════════════════════════╝

┌─ GPU MEMORY (D2D cudaMemcpy) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      356.5 GB/s [ 248.0- 437.4] ± 72.5  130.6%
  4 MB      329.6 GB/s [ 128.9- 479.5] ±155.5  120.7%
  8 MB      423.4 GB/s [ 417.9- 426.5] ±  3.7  155.1%
  32 MB      98.7 GB/s [  84.9- 103.2] ±  6.9   36.2%
  128 MB    111.3 GB/s [ 105.8- 113.6] ±  2.9   40.8%
  512 MB    113.4 GB/s [ 113.1- 113.8] ±  0.3   41.6%
  1024 MB   116.0 GB/s [ 115.5- 116.4] ±  0.3   42.5%
  2048 MB   119.0 GB/s [ 118.7- 119.1] ±  0.2   43.6%

┌─ GPU MEMORY (kernel, 48 SMs) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      562.3 GB/s [ 149.2- 643.1] ±146.2  206.0%
  4 MB     1156.9 GB/s [1118.0-1189.5] ± 17.4  423.8%
  8 MB      956.4 GB/s [ 687.5-1040.0] ±116.5  350.3%
  32 MB     201.7 GB/s [ 177.0- 216.2] ± 14.7   73.9%
  128 MB    217.3 GB/s [ 211.7- 220.1] ±  2.9   79.6%
  512 MB    218.1 GB/s [ 217.1- 219.0] ±  0.7   79.9%
  1024 MB   215.5 GB/s [ 214.9- 216.2] ±  0.5   79.0%
  2048 MB   212.6 GB/s [ 212.6- 212.6] ±  0.0   77.9%

┌─ CPU↔GPU H→D (pageable, 5 trials) ─────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.C2C @600
  2 MB       11.8 GB/s [  11.1-  12.4] ±  0.4    2.0%
  4 MB       19.0 GB/s [  18.2-  20.0] ±  0.6    3.2%
  8 MB       29.2 GB/s [  28.5-  29.5] ±  0.4    4.9%
  32 MB      45.7 GB/s [  44.9-  46.8] ±  0.7    7.6%
  128 MB     41.8 GB/s [  40.6-  43.7] ±  1.0    7.0%
  512 MB     51.1 GB/s [  51.1-  51.2] ±  0.1    8.5%
  1024 MB    54.6 GB/s [  54.4-  54.7] ±  0.1    9.1%
  2048 MB    56.7 GB/s [  56.7-  56.8] ±  0.0    9.5%

┌─ CPU↔GPU D→H (pageable, 5 trials) ─────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.C2C @600
  2 MB       11.9 GB/s [  11.4-  12.2] ±  0.3    2.0%
  4 MB       19.4 GB/s [  18.5-  20.0] ±  0.5    3.2%
  8 MB       31.0 GB/s [  28.4-  37.1] ±  3.1    5.2%
  32 MB      46.1 GB/s [  45.5-  46.7] ±  0.5    7.7%
  128 MB     42.0 GB/s [  38.7-  44.1] ±  1.8    7.0%
  512 MB     51.7 GB/s [  51.2-  52.4] ±  0.5    8.6%
  1024 MB    55.4 GB/s [  55.3-  55.5] ±  0.1    9.2%
  2048 MB    56.9 GB/s [  56.7-  57.2] ±  0.2    9.5%

┌─ CPU↔GPU H→D (pinned, 5 trials) ─────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.C2C @600
  2 MB       27.2 GB/s [  24.8-  29.1] ±  1.4    4.5%
  4 MB       20.3 GB/s [  19.3-  21.1] ±  0.6    3.4%
  8 MB       30.2 GB/s [  29.3-  31.2] ±  0.6    5.0%
  32 MB      44.1 GB/s [  39.7-  47.2] ±  2.6    7.3%
  128 MB     43.1 GB/s [  42.2-  44.5] ±  0.9    7.2%
  512 MB     51.6 GB/s [  51.2-  52.0] ±  0.3    8.6%
  1024 MB    54.9 GB/s [  54.7-  54.9] ±  0.1    9.1%
  2048 MB    56.9 GB/s [  56.7-  57.1] ±  0.1    9.5%

┌─ CPU↔GPU D→H (pinned, 5 trials) ─────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.C2C @600
  2 MB       27.2 GB/s [  22.6-  29.3] ±  2.5    4.5%
  4 MB       35.7 GB/s [  32.3-  37.8] ±  1.9    6.0%
  8 MB       42.6 GB/s [  38.9-  43.9] ±  1.9    7.1%
  32 MB      48.9 GB/s [  39.2-  55.5] ±  5.3    8.2%
  128 MB     41.4 GB/s [  40.3-  43.0] ±  0.9    6.9%
  512 MB     51.3 GB/s [  51.3-  51.4] ±  0.0    8.6%
  1024 MB    54.8 GB/s [  54.7-  55.0] ±  0.1    9.1%
  2048 MB    56.8 GB/s [  56.7-  56.8] ±  0.0    9.5%

╔═══════════════════════════════════════════════════════════╗
║              UNIFIED MEMORY BENCHMARKS                    ║
╠═══════════════════════════════════════════════════════════╣
║ Testing zero-copy paths that bypass cudaMemcpy overhead   ║
╚═══════════════════════════════════════════════════════════╝

┌─ MANAGED MEMORY (cudaMallocManaged, 5 trials) ──────────┐
  NOTE: On unified memory, prefetch should be near-instant (no copy)
  Size      Mean      [Min   -   Max]    SD     vs.C2C @600
  2 MB     1132.7 GB/s [1096.0-1151.2] ± 20.8  188.8%
  4 MB     1361.4 GB/s [1295.2-1385.5] ± 33.5  226.9%
  8 MB     1658.9 GB/s [1321.4-1831.6] ±181.5  276.5%
  32 MB    2420.7 GB/s [2418.2-2424.5] ±  2.3  403.5%
  128 MB   2587.3 GB/s [2375.5-2666.1] ±106.8  431.2%
  512 MB   2699.7 GB/s [2662.8-2742.0] ± 25.2  449.9%
  1024 MB  2940.7 GB/s [2713.2-3029.2] ±123.2  490.1%
  2048 MB  2852.7 GB/s [2676.1-2966.1] ±102.3  475.4%

┌─ ZERO-COPY (kernel access, 5 trials) ─────────────────┐
  NOTE: GPU kernel reads/writes host memory directly (no cudaMemcpy)
  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  2 MB      249.5 GB/s [ 217.0- 297.1] ± 32.2   91.4%
  8 MB      326.2 GB/s [ 276.6- 360.2] ± 29.0  119.5%
  32 MB     206.4 GB/s [ 195.6- 214.8] ±  7.3   75.6%
  128 MB    214.6 GB/s [ 211.1- 217.5] ±  2.0   78.6%
  512 MB    214.0 GB/s [ 212.6- 215.1] ±  0.9   78.4%
  1024 MB   214.2 GB/s [ 213.3- 214.7] ±  0.5   78.5%
  2048 MB   214.9 GB/s [ 214.1- 215.4] ±  0.5   78.7%

┌─ C2C INTERCONNECT (cache-resident, 5 trials) ───────────┐
  GB10 Cache Hierarchy:
    GPU L2:  24 MB | CPU L3: 24 MB | SLC: 16 MB (shared)
  NOTE: Small buffers stay in caches, bypassing DRAM (~273 GB/s)
  Size      Mean      [Min   -   Max]    SD     vs.C2C @600
  64 KB       2.5 GB/s [   1.9-   2.8] ±  0.3    0.4% (SLC)
  256 KB      5.4 GB/s [   3.1-   7.1] ±  1.7    0.9% (SLC)
  1 MB        9.6 GB/s [   9.3-   9.9] ±  0.2    1.6% (SLC)
  4 MB       16.1 GB/s [  13.1-  19.2] ±  2.1    2.7% (SLC)
  8 MB       18.4 GB/s [  15.2-  20.1] ±  1.9    3.1% (SLC)
  16 MB      22.5 GB/s [  20.2-  24.1] ±  1.5    3.8% (SLC)
  23 MB      24.4 GB/s [  21.6-  26.6] ±  1.9    4.1% (L2→SLC)

┌─ C2C ASYNC BIDIRECTIONAL (5 trials) ─────────────────┐
  NOTE: Simultaneous H→D and D→H to saturate full-duplex C2C
  Size      Mean      [Min   -   Max]    SD     vs.C2C @1200
  64 KB      29.5 GB/s [  29.3-  30.1] ±  0.3    2.5%
  256 KB     44.2 GB/s [  40.8-  46.1] ±  1.9    3.7%
  1 MB       51.2 GB/s [  46.8-  52.5] ±  2.2    4.3%
  4 MB       56.4 GB/s [  55.5-  56.9] ±  0.5    4.7%
  8 MB       58.1 GB/s [  58.0-  58.2] ±  0.1    4.8%
  16 MB      58.3 GB/s [  56.5-  58.9] ±  0.9    4.9%
  23 MB      58.9 GB/s [  58.9-  58.9] ±  0.0    4.9%

┌─ CACHE-TO-CACHE (L2 persistent, 5 trials) ────────────┐
  Strategy: Pin GPU data in L2, keep CPU data hot in L3
  GPU L2: 24 MB | CPU L3: 24 MB | SLC: 16 MB
  Size      Mean      [Min   -   Max]    SD     vs.C2C @600
  64 KB      60.8 GB/s [  56.8-  64.3] ±  3.2   10.1%
  256 KB    227.8 GB/s [ 225.6- 229.4] ±  1.2   38.0%
  1 MB      480.5 GB/s [ 473.9- 482.5] ±  3.3   80.1%
  4 MB      881.7 GB/s [ 721.3-1010.4] ± 94.8  146.9%
  8 MB      610.7 GB/s [ 498.6- 657.9] ± 57.3  101.8%
  16 MB     270.1 GB/s [ 263.6- 273.2] ±  3.4   45.0%

┌─ C2C KERNEL ACCESS (GPU→CPU L3, 5 trials) ─────────────┐
  NOTE: GPU kernel directly reads pinned host memory
  Size      Mean      [Min   -   Max]    SD     vs.C2C @600
  64 KB      25.9 GB/s [  21.3-  29.9] ±  3.1    4.3%
  256 KB     77.5 GB/s [  73.6-  84.2] ±  3.6   12.9%
  1 MB      155.5 GB/s [  80.4- 244.7] ± 62.5   25.9%
  4 MB      196.6 GB/s [  80.0- 318.4] ± 92.9   32.8%
  8 MB      246.4 GB/s [ 102.7- 308.6] ± 75.1   41.1%
  16 MB     234.8 GB/s [ 230.9- 236.9] ±  2.0   39.1%

Multi-Node NCCL Networking

The NCCL benchmarks run across two DGX Spark nodes connected via 200GbE. The key question: does data actually flow GPU-to-GPU across the network, or does it stage through CPU memory?

The answer is subtle on unified memory. The benchmarks allocate buffers with cudaMalloc for device pointers -- on the GB10, this allocates from the shared LPDDR5X pool. NCCL then orchestrates transfers via the ConnectX-7 NIC. Because there's no separate GPU memory to "direct" around, the cudaHostAlloc path achieves nearly identical throughput to device pointers. This is the unified memory advantage in action.

NCCL Multi-Node Benchmarks — 200GbE Performance (click to expand)

[NCCL] Initializing 2 ranks...
[NCCL] All ranks ready

╔═══════════════════════════════════════════════════════════╗
║                   NCCL BENCHMARKS                         ║
╠═══════════════════════════════════════════════════════════╣
║ Network: 200GbE @25.0                                     ║
╚═══════════════════════════════════════════════════════════╝

┌─ AllGather (device pointers) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.Net OK
  2 MB       28.9 GB/s [  27.0-  30.0] ±  1.1  115.4% ✓
  4 MB       32.6 GB/s [  31.8-  33.2] ±  0.6  130.3% ✓
  8 MB       34.4 GB/s [  33.7-  35.1] ±  0.6  137.8% ✓
  32 MB      34.9 GB/s [  34.5-  35.4] ±  0.3  139.8% ✓
  128 MB     40.8 GB/s [  40.1-  41.5] ±  0.5  163.2% ✓
  512 MB     44.2 GB/s [  44.0-  44.3] ±  0.1  176.7% ✓
  1024 MB    44.9 GB/s [  44.8-  44.9] ±  0.1  179.6% ✓
  2048 MB    45.2 GB/s [  45.2-  45.3] ±  0.0  180.9% ✓

┌─ AllGather (host-registered pointers) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.Net OK
  2 MB       27.0 GB/s [  25.5-  28.4] ±  1.0  107.8% ✓
  4 MB       31.2 GB/s [  30.4-  32.2] ±  0.6  124.8% ✓
  8 MB       34.1 GB/s [  32.6-  34.9] ±  0.8  136.4% ✓
  32 MB      35.4 GB/s [  35.1-  35.9] ±  0.3  141.7% ✓
  128 MB     40.8 GB/s [  40.0-  41.6] ±  0.5  163.4% ✓
  512 MB     44.0 GB/s [  43.8-  44.1] ±  0.1  175.8% ✓
  1024 MB    44.7 GB/s [  44.6-  44.8] ±  0.1  178.8% ✓
  2048 MB    45.1 GB/s [  45.0-  45.1] ±  0.0  180.2% ✓

┌─ SendRecv Ring (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.Net OK
  2 MB       30.7 GB/s [  30.1-  31.5] ±  0.5  122.8% ✓
  4 MB       34.9 GB/s [  34.0-  36.0] ±  0.8  139.6% ✓
  8 MB       38.0 GB/s [  37.4-  38.5] ±  0.4  152.0% ✓
  32 MB      37.7 GB/s [  37.3-  38.6] ±  0.5  151.0% ✓
  128 MB     40.4 GB/s [  39.1-  41.7] ±  1.0  161.6% ✓
  512 MB     41.0 GB/s [  40.8-  41.5] ±  0.2  164.1% ✓
  1024 MB    41.0 GB/s [  40.8-  41.3] ±  0.2  164.1% ✓
  2048 MB    41.3 GB/s [  41.2-  41.4] ±  0.1  165.3% ✓

┌─ All-to-All (stresses switch) (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.Net OK
  2 MB       20.9 GB/s [  18.0-  23.4] ±  1.9   83.5% ✓
  4 MB       26.6 GB/s [  24.6-  29.5] ±  1.8  106.5% ✓
  8 MB       33.0 GB/s [  31.2-  34.5] ±  1.2  131.8% ✓
  32 MB      34.5 GB/s [  30.7-  35.8] ±  1.9  137.9% ✓
  128 MB     38.5 GB/s [  36.8-  40.4] ±  1.5  154.1% ✓
  512 MB     39.1 GB/s [  38.1-  39.7] ±  0.6  156.5% ✓
  1024 MB    39.2 GB/s [  38.8-  39.5] ±  0.3  156.7% ✓
  2048 MB    39.0 GB/s [  38.9-  39.2] ±  0.1  155.8% ✓

┌─ ReduceScatter (5 trials) ─────────────────────┐
  Size      Mean      [Min   -   Max]    SD     vs.Net OK
  2 MB       24.4 GB/s [  21.2-  26.7] ±  2.1   97.5% ✓
  4 MB       26.4 GB/s [  23.0-  29.3] ±  2.4  105.5% ✓
  8 MB       30.5 GB/s [  27.6-  32.2] ±  1.6  122.0% ✓
  32 MB      33.5 GB/s [  32.6-  34.2] ±  0.6  133.9% ✓
  128 MB     37.0 GB/s [  36.6-  37.5] ±  0.4  148.1% ✓
  512 MB     42.8 GB/s [  42.5-  43.2] ±  0.3  171.3% ✓
  1024 MB    44.1 GB/s [  43.9-  44.3] ±  0.1  176.3% ✓
  2048 MB    44.8 GB/s [  44.8-  44.9] ±  0.1  179.3% ✓

The Documentation Gap Compounds

When LLMs are trained on coverage that contains errors like the Thunderbolt 5 bandwidth ambiguity, those errors propagate into the next generation of content. The Spark shipped with CUDA 13, but LLMs available in October 2025 were trained on CUDA 12 documentation. When I asked Claude for help with Grace-specific optimizations, it sometimes provided outdated guidance or claim that I made a typo with my versioning. Reviewer's guides prevent this kind of error propagation by establishing ground truth at the source, but I don't believe a reviewer's guide was made available.

A Taste of the Ecosystem

The DGX Spark isn't NVIDIA's fastest hardware. It's not trying to be. What it offers is a small taste of the NVIDIA ecosystem and a desktop-scale introduction to the software stack, networking architecture, and design philosophy that powers datacenter AI.

The compute story has nuance. Yes, raw throughput is modest compared to discrete GPUs with dedicated HBM or something like an export-controlled gaming RTX 5090. But when workloads stay cache resident, and you also take advantage of unified memory, it's fast enough for many single-user applications. The unified memory architecture trades peak bandwidth for simplicity and capacity. Different tradeoffs for different problems.

Cache effects tell an interesting architectural story. The RTX 5090's massive 96 MB L2 cache achieves 4.6 TB/s at 32 MB buffer sizes, spectacular numbers that even dwarf its export controlled 1.6 TB/s sustained DRAM bandwidth. The DGX Spark's 24 MB L2 hits its sweet spot earlier, at 4 MB, reaching 882 GB/s.

Normalized per streaming multiprocessor, the picture shifts. The RTX 5090 achieves 27 GB/s per SM (4,582 GB/s ÷ 170 SMs). The Spark reaches 18 GB/s per SM (882 GB/s ÷ 48 SMs)--about two-thirds the per-SM throughput. The absolute 5× difference in total bandwidth comes primarily from SM count (170 vs 48), not cache architecture inefficiency. The Spark's unified memory also means cache-resident workloads don't require PCIe staging.

The 200GbE Story

The networking capability deserves more attention than it receives which is why I'm repeating it again. At 45+ GB/s measured bidirectional NCCL throughput, a two-node Spark cluster achieves bandwidth comparable to a RAID 0 array of PCIe 5.0 NVMe drives. A single PCIe 4.0 NVMe tops out around 7 GB/s and the DGX Spark's inter-node link delivers over 6× that speed, across the network.

This matters because it previews where the ecosystem is heading. ConnectX-7 at 200GbE is already impressive. ConnectX-9, announced for future platforms, targets 800GbE! Sixth-generation NVLink, shipping in Blackwell datacenter systems, delivers 3.6 TB/s aggregate. The Spark lets developers learn NCCL collective patterns and distributed training workflows at desktop scale, then transfer that knowledge directly to infrastructure orders of magnitude faster. What's missing is a desktop-sized 4-port 200 GbE switch allowing users to scale their DGX Spark work up to 4 nodes.

Understanding the Carmack Critique

John Carmack's criticism deserves careful consideration not dismissal. His software career, from Commander Keen (which I loved as a kid!) through Quake III's revolutionary BSP trees and Bezier patches to Doom 3's unified lighting, was defined by pushing hardware to absolute limits. When Carmack says he's getting half the expected compute, he's measuring what he cares about: the compute needed for AGI research he pursuing at Keen Technologies. For that workload profile, his criticism is valid. AGI research involves massive dense computations that won't benefit from sparsity optimizations. The Spark's thermal envelope, compact chassis, power ceiling are all constraints that matter for Carmack's use case.

For sustained shader compute, I ran Radiance, a raymarching benchmark that stresses FP32 throughput without RT Cores or Tensor Cores. The DGX Spark scored 590 points versus 2,370 on the RTX 5090--roughly 4× slower, consistent with the SM count ratio (48 vs 170) rather than thermal throttling.

But this reveals the expectations gap. NVIDIA marketed "1 petaflop" without a technology demonstration showing how to achieve it. No reviewer's guide explaining which workloads hit that number. No reference implementation. No benchmark suite. The spec is technically accurate under specific conditions that weren't clearly communicated. People who need that performance, like John Carmack, will come back with a disappointment. Others, who just need Rolls Royce levels of sufficient power and a constrained budget and need for robust developer tools will have a different view.

The Missing Demo

In the early era of 3D gaming, people wondered if 30 fps or 60 fps actually mattered. Features like bump mapping or src*dst texture math to create "colored lighting" effects needed to be seen to be understood. When ray tracing cores were introduced in the RTX 20 series, the Star Wars RTX demo was incredible. These demos let reviewers and developers see exactly what the hardware could do. The DGX Spark launched without equivalent demonstrations for its AI capabilities. How do you hit 1 petaflop? What workloads benefit from sparsity? How should developers structure code to maximize cache residency? These questions have answers, but the answers weren't shipped with the product.

This isn't unique to DGX. Visit NVIDIA's GeForce tech demo download page today and the most recent entries are from April 2019: Reflections RTX, Atomic Heart RTX, Justice RTX. All graphics demos, all six years old. The consumer side stopped producing downloadable capability demonstrations years ago. CES and GTC keynotes showcase what partners are building and NVIDIA open models, but that's different from "here's something that shows what YOUR hardware can do." Marbles RTX and Racer RTX exist but require installing Omniverse and creating developer accounts, a far cry from downloading Dawn.exe and watching the fairy unfold her wings. When the consumer GPU division stopped producing simple, downloadable demos, the AI and enterprise divisions had no template to follow.

NVIDIA did provide the DGX Spark Playbooks which are tutorials covering multi-agent chatbots, RAG applications, Comfy UI, and other AI workflows. But examine them closely: most are platform-agnostic. Open WebUI with Ollama? Runs on a MacBook. Comfy UI for image generation? Works on AMD. VS Code setup? Generic. LLaMA Factory fine-tuning? Cross-platform. The genuinely NVIDIA-specific playbooks--CUDA-X data science with cuML, TensorRT-LLM inference, NVFP4 quantization--exist but don't showcase what makes the Spark special. There's no playbook demonstrating 200GbE hitting 45+ GB/s bidirectional and then running the same comparison with a 10 GbE cable. No tutorial on unified memory eliminating staging buffers. No walkthrough of cache-to-cache transfers reaching 882 GB/s. The playbooks teach workflows but they didn't demonstrate unique capabilities. Some were even simple extensions of a user manual.

The proof that this approach was insufficient? The customer feedback itself. Not because the hardware was bad, but because the documentation didn't bridge the gap between discrete GPU patterns and unified memory realities. The playbooks existed; the capability demonstrations didn't.

Enterprise products at consumer prices need consumer-grade onboarding, a capability NVIDIA's consumer division stopped developing years ago.

This isn't a criticism of intent. NVIDIA now spans gaming, datacenter AI, automotive, robotics, healthcare imaging, drug discovery, digital twins, climate simulation, and embodied AI. They are even in quantum computing. Who knows what else is in top-secret development? Organizational bandwidth doesn't scale linearly with market capitalization. The institutional knowledge of how to build compelling demos is gone. The people who made Dawn and Luna and the GeForce 3 chameleon have moved on to other projects or retired. Different teams with different priorities. The gap is real, but it's likely a reflection of breadth, not neglect. Today's tech demo's are harder since they aren't about graphics. OpenAI's ChatGPT-3 was the ultimate tech demo, and look at where we are now.

Ironically, the playbooks may have served education better than intended. Some early versions shipped with bugs such as dependency issues, configuration problems, the usual launch-day friction. Working through those bugs forced users like me to understand the underlying systems. Debugging is how many of us learned to program in the first place. Starting with the Multi-Agent Chatbot and RAG playbooks, then hitting walls, is what convinced me to build OrthoSystem from scratch. The frustration became motivation; the debugging became training.

A Category Problem Leads To a Marketing Challenge

What benchmark suite do you use for hardware that spans every NVIDIA product category?

The Spark attempts to be an inference platform (Llama 405B, Flux, RAG), a training system (fine-tuning, LoRA), a CUDA compute node (UMAP, simulations), a networking endpoint (bidirectional 200 GbE NCCL hitting 45 GB/s), and yes, even a gaming machine. No single reviewer has expertise across all these domains. A gaming journalist evaluates differently than an ML engineer. An HPC specialist cares about things an AI researcher ignores.

The proverb applies: Jack of all trades, master of none.

But most people only know the first half. The complete saying runs: Jack of all trades, master of none--but oftentimes better than a master of one.

For specialists with narrow requirements, purpose-built solutions may serve better. For generalists whose work crosses boundaries--the radiologist experimenting with segmentation models, the computational biologist running molecular simulations, the security researcher using AI to analyze malware, the Spark is the machine that doesn't force you to choose.

"The researcher who needs to work with sensitive data, run inference, visualize embeddings with GPU-accelerated UMAP, and scale up via NCCL just needs the DGX Spark."

These benchmarks reveal something beyond raw performance: they expose how assumptions built for one architecture don't transfer to another. The mental models that made sense for discrete GPUs—staging buffers, pinned memory optimization, PCIe bottleneck awareness—become obstacles when evaluating unified memory. This isn't a documentation problem. It's a translation problem.

A Translation Failure, Not a Documentation Failure

In the beginning of this review, I said this was a story about how we communicate complex technology, how we train experts, and how we preserve generational knowledge. The DGX Spark's launch crystallized all three.

Consider NVIDIA's official FAQ: For performance reasons, specifically for CUDA contexts associated to the iGPU, the system memory returned by the pinned device memory allocators (e.g. cudaMalloc) cannot be coherently accessed by the CPU complex nor by I/O peripherals like PCI Express devices. Hence the GPUDirect RDMA technology is not supported, and the mechanisms for direct I/O based on that technology, for example nvidia-peermem (for DOCA-Host), dma-buf or GDRCopy, do not work.

NVIDIA Developer Forums - DGX Spark / GB10 FAQ showing GPUDirect RDMA is not supported — Source: forums.developer.nvidia.com/t/dgx-spark-gb10-faq/347344

This is technically accurate. It's also misleading. My benchmarks show 45+ GB/s GPU-to-GPU transfers over 200GbE which is wire speed, sustained, bidirectional. The FAQ describes what doesn't work; it never explains what does.

When you read that FAQ with a discrete GPU mental model, the brain interprets it as: "no GPUDirect RDMA means poor performance." The unified memory reality says: cudaHostAlloc buffers already live in GPU-accessible address space so you don't need GPUDirect because there's nothing to direct around and there is no performance penalty moving memory from GPU1 across the network to GPU2 on the GB10. The NCCL benchmarks prove it: device pointers and host-registered pointers achieve nearly identical throughput because on unified memory, the distinction barely matters.

This is a translation failure, not a documentation failure. The words are correct. The mental model required to interpret them correctly is increasingly rare.

The generation of NVIDIA engineers who built technology demonstrations in the early GeForce days, who understood that communication means building bridges between what experts know and what users need to learn, have largely moved on. The generation of computer science students trained entirely on Python and cloud abstractions never learned why memory pinning matters, so they can't recognize when it doesn't. And the generation of surgeons who learned orthopaedics through open approaches and complications are retiring, taking with them the judgment that no textbook captures.

These are the same problem. They just wear different clothes.

Personal Thoughts: Knowledge Transfer Across Generations

When I covered NVIDIA in the late 1990's and early 2000s, the company had one and then just two markets: gaming GPUs and professional visualization. Reviewer's guides were comprehensive because the surface area was manageable. We received documentation explaining architectures, not just specifications. We had technology demonstrations which were G-rated tech demos designed specifically to showcase features.

The last memorable tech demo was the RTX 20 series Star Wars collaboration. That was 2018. Today's best showcases are M-rated games like Cyberpunk 2077--technically impressive but unusable for family-friendly marketing.

Meanwhile, NVIDIA's scope exploded: gaming, datacenter AI, automotive, robotics, healthcare imaging, drug discovery, digital twins, climate simulation, embodied AI, desktop AI. Organizational bandwidth doesn't scale linearly with market capitalization. Something has to give, and what gives is depth of communication in any single product category. Give people a ready-to-run version of Nemotron Nano, at NVFP4 and BF16 with some sample prompts that showcases the performance boost and precision (or imperceptible degradation) that occurs in quality. NVIDIA has TRT-LLM and in-house LLMs -- no new IP needed, just a better demo.

The CS Education Shift

I see related patterns in how programmers are trained today. When having students work with me on research, I've come to understand that you can earn a CS minor relying entirely on Python, never understanding compilation, memory management, or hardware architectures. The Verge reported that engineering students had trouble with files and folders because they have grown up in a world of "search". Think about companies like Athos Silicon. IRQs (interrupt requests) are something elder millennials and Gen X'ers might remember from DOS days, configuring sound cards and avoiding conflicts. But interrupt handling has profound relevance for modern autonomous systems and a world trained on abstracted computing won't appreciate the challenges of modern AI tools where real-time response matter.

When a sensor detects an obstacle, the system must respond within microseconds not whenever the scheduler gets around to it. Traditional operating systems introduce latency through context switches, kernel overhead, and unpredictable scheduling. Hard real-time systems need deterministic response times. You can do a lot of work to optimize this, and it's no surprise that François Piednoël, the former Performance/Principal System Architect at Intel who spent years optimizing exactly these kinds of latency-critical paths, is the co-founder and CTO of Athos, working on this problem.

This matters directly for surgical robotics. When the anesthesiologist reports a sudden physiological change, or neurological electrophysiology monitoring shows an unexpected signal, the robotic system must respond in real-time to inputs, potentially outside the visual field of view of the surgery. That's not a Python problem or a CUDA problem. That's bare-metal interrupt handling, and the knowledge required to implement it correctly seems to be increasingly rare.

The Medical Parallel: Divergence, Not Convergence

Software developers face a knowledge transfer crisis, but they have AI coding assistants to help bridge gaps. My experience with the OpenMP bug was a reflection of my lack of CS experience and documentation. Three of the leading AI coding assistants couldn't help at the time I was building the benchmark, but now that it's documented in NVIDIA's FAQ as a result of this review, future LLMs will handle the questions easily. Medicine faces the same crisis without that tooling available (yet!)

In 2003, duty hour restrictions capped resident work weeks at 80 hours, down from the 100+ that previous generations endured. The changes were necessary: tired physicians make mistakes, and residents were dying in car accidents driving home from 36-hour shifts. Patient safety improved. Burnout decreased. These are genuine wins.

But duty hour restrictions didn't reduce the amount of knowledge required to become a competent surgeon. They reduced the time available to acquire it.

And here's what makes knowledge transfer increasingly difficult: medicine doesn't converge toward simplicity. It diverges toward complexity. In the 1950s, when Osgood was developing a measurement of meaning that would provide the basis of LLMs today, orthopaedic surgery was relatively straightforward: a handful of approaches to common problems. When total hip replacement came out in 1962, there was one set of materials and one implant design. Today we have different implant choices and surgical approaches for the same anatomy. Spine surgery with dozens of fusion and motion-sparing techniques, joint replacement with hundreds of implant systems, sports medicine with arthroscopic procedures that didn't exist thirty years ago. Each subspecialty already consume a career, each patient having "one opportunity" to get the best individualized plan. We don't always converge toward consensus. We discover nuance. Each advance creates new decision points. The residents I train must learn more than I learned, in less time, with fewer cases.

My own practice has shifted increasingly toward minimally invasive techniques, genuine wins for patients with faster recovery and less tissue trauma. But this means residents have fewer opportunities to see, learn, and perform the "big spine surgeries." That open surgical experience is precisely what you need to manage complications when minimally invasive approaches fail. The experts who trained me are retiring. Their institutional knowledge exists primarily in their heads. The subtle factors that influence surgical decisions, the complications that textbooks don't describe, the judgment that accumulates over thousands of cases. We're running out of time to extract it.

Why OrthoSystem Exists

This is the driving force behind OrthoSystem.ai.

The goal isn't to replace surgical training with AI. Nothing replaces hands-on experience in the operating room. The goal is to use AI to accelerate the book learning part of orthopaedics. To be able to take the lectures and PowerPoints from seasoned clinicians and capture them and translate them into the way learners learn today. The better foundation a resident has in the core principles of anatomy, biomechanics, and clinical reasoning, the faster you learn when you're actually operating. Healing a problematic fracture involves thinking about things like bioactive factors, responding cells, cellular scaffolds, adequate blood supply and mechanical stability, almost identical to thinking about computing performance balancing API overheads, compute power, memory bandwidth, and cache misses.

A resident who deeply understands why a particular approach works for a particular patient will learn more from each case than one who's just following protocols or repeating what they saw their mentor doing. AI can help build that understanding.

The DGX Spark enables this because it runs locally with sensitive data that can't leave institutional boundaries like morbidity and mortality conferences. Running things locally simplifies the issue of using copyrighted material for the knowledge database -- OrthoSystem is the bookshelf and librarian. Users fill the shelves with their own material, not material that I provide. Over time, patient information, proprietary techniques, and other knowledge can be input into it. The cloud model doesn't work. A desktop supercomputer that processes this locally makes possible applications that cloud architectures cannot support.

I bought my first DGX Spark on October 14th and my second on October 23rd. That's the review.

The hardware handles everything OrthoSystem requires for development: embedding generation, reranking, multi-node inference, serving quantized models at interactive speeds. The software expertise and clinical expertise required to build it? Lives in my unified memory. (grin)

The limiting factor is time. Bandwidth is constrained when development happens on nights and weekends without dedicated research time. And latency isn't a luxury we have in surgical training. The mentors who trained me--the generation of surgeons who carry institutional knowledge accumulated over thousands of cases, the judgment that textbooks don't capture--have perhaps five to ten years before retirement. Capturing what they know means residents need to acquire textbook knowledge faster, so they can learn more from every case in the OR. The window is measured in years, not decades.

Conclusion

James Cameron, speaking to NASA about space exploration 20 years ago, articulated the communication challenge: "What can we do better to reach out to the public? Two things. One, tell the story better. Two, have a better story to tell."

NVIDIA has a remarkable story with the DGX Spark. My benchmarks show genuine strengths: 213 GB/s DRAM bandwidth, 882 GB/s at the cache-to-cache sweet spot, 45+ GB/s inter-node bidirectional networking, and unified memory that eliminates optimization complexity. The early criticism reflected real problems that real users experienced—and also the challenge of evaluating unified memory with mental models built for discrete GPUs.

The opportunity is building the documentation, demonstrations, and educational pathways that help users succeed. That work benefits from community contribution alongside vendor investment. Every Spark user who shares what they learn contributes to collective understanding.

After three months, I'm convinced the hardware delivers. The question is whether NVIDIA can tell the story well enough to realize its potential--for developers, for researchers, for patients, and for the continuity of knowledge that civilization requires.

Alan B.C. Dang, MD

Health Sciences Clinical Professor of Orthopaedic Surgery, UCSF. Staff Surgeon, San Francisco VA Health Center. Former GPU hardware reviewer, Thresh's FiringSquad and Tom's Hardware.

Building OrthoSystem on nights and weekends. Open to collaborations that let me do this during business hours.