// gpu cluster network validation

Your cluster passed delivery.
Does it pass production?

Peer-level NCCL and fabric validation for neoclouds, sovereign AI buyers, and enterprise AI labs. 25 years of hyperscale networking at Google, Meta, Cisco, and Juniper — applied to your cluster before your first training run.

Book a discovery call View engagements
hyperscale background
Google
Meta
Cisco
Juniper Networks
Nortel
25 years networking
InfiniBand · RoCEv2 · NVLink
// the cost of not knowing

Every hour of degraded goodput has a dollar figure.

A 3,000-GPU cluster at $2/hr costs $6,000 per hour to run. A cluster delivering 65% of theoretical bandwidth instead of 92% wastes 27% of every GPU-hour — silently, until someone measures it.

$6K
Per hour of cluster downtime
3,000-GPU cluster at $2/hr. Two hours of unplanned downtime = $12K. A misconfigured fabric event = days.
58.7%
Of Llama 3.1 training failures were GPU issues
Meta's published data from a 54-day training run. Hardware issues are the leading cause — most caught too late.
92%
Theoretical bandwidth is the target
Industry standard: all_reduce_perf should hit ≥370 GB/s on a 400 Gbps fabric. Most unvalidated clusters don't.
72 hrs
Before thermal runaway was detected
One documented cluster failure: $28M in hardware damage from inadequate pre-production testing. A 4-hour stress test wasn't enough.
// engagements

Three engagements. One validator.

Fixed scope, fixed price, named deliverables. No hourly billing, no scope creep, no junior consultants. You work directly with Larkland Cox on every engagement.

GAT · 5 days
Cluster acceptance test
$20K
Flat fee · delivery imminent
  • Node-level GPU + NVLink validation via DCGM and nvbandwidth
  • Intra-node NCCL benchmark sweep
  • Full-cluster inter-node fabric validation (IB or RoCEv2)
  • RAG-scored acceptance report with signed recommendation
  • No-access model available (log-bundle option)
GHA · annual
Health audit retainer
$40–80K
Per year · post first engagement
  • Quarterly NCCL regression + RYG health report
  • Firmware drift monitoring + watchlist alerts
  • Annual drift analysis with validated stack update
  • 48-hr incident response SLA (2 incidents/yr included)
  • Priority scheduling on new engagements
// how it works

From discovery call to signed report in two weeks.

Every engagement follows the same four steps. No discovery retainer, no requirements phase, no project kickoff theater.

STEP 01
Discovery call
30 minutes. You describe your cluster profile, workload, timeline, and any symptoms. I confirm the right engagement and send a SOW within 24 hours.
STEP 02
SOW + access
Countersign the SOW, provide SSH access or run the pre-built log-bundle script. No complex onboarding. Engagement starts within your specified window.
STEP 03
Validation runs
I run the full toolchain — DCGM, nccl-tests, nvbandwidth, ibdiagnet, NCCL_DEBUG analysis — against your specific hardware and fabric. No templated checklists.
STEP 04
Signed report
RAG-scored report with every benchmark result, every anomaly, every remediation recommendation — and a signed acceptance recommendation your team can act on.
// book a call

What's your cluster doing right now?

If you have a delivery window coming up, a cluster that's underperforming, or a production cluster that's never been regression-tested — send a note. I'll tell you in one call whether there's a problem and what it would take to fix it.

Or reach out on LinkedIn · Response within 24 hours