// gpu cluster network validation

Your cluster passed delivery.
Does it pass production?

Peer-level NCCL and fabric validation for neoclouds, sovereign AI buyers, and enterprise AI labs. 25 years of hyperscale networking at Google, Meta, Cisco, and Juniper — applied to your cluster before your first training run.

Book a discovery call View engagements

nccl-validation — all_reduce_perf — 512×H100 SXM5

// the cost of not knowing

Every hour of degraded goodput has a dollar figure.

A 3,000-GPU cluster at $2/hr costs $6,000 per hour to run. A cluster delivering 65% of theoretical bandwidth instead of 92% wastes 27% of every GPU-hour — silently, until someone measures it.

$6K

Per hour of cluster downtime

3,000-GPU cluster at $2/hr. Two hours of unplanned downtime = $12K. A misconfigured fabric event = days.

58.7%

Of Llama 3.1 training failures were GPU issues

Meta's published data from a 54-day training run. Hardware issues are the leading cause — most caught too late.

92%

Theoretical bandwidth is the target

Industry standard: all_reduce_perf should hit ≥370 GB/s on a 400 Gbps fabric. Most unvalidated clusters don't.

72 hrs

Before thermal runaway was detected

One documented cluster failure: $28M in hardware damage from inadequate pre-production testing. A 4-hour stress test wasn't enough.

// engagements

Three engagements. One validator.

Fixed scope, fixed price, named deliverables. No hourly billing, no scope creep, no junior consultants. You work directly with Larkland Cox on every engagement.

GAT · 5 days

Cluster acceptance test

$20K

Flat fee · delivery imminent

Node-level GPU + NVLink validation via DCGM and nvbandwidth
Intra-node NCCL benchmark sweep
Full-cluster inter-node fabric validation (IB or RoCEv2)
RAG-scored acceptance report with signed recommendation
No-access model available (log-bundle option)

GTU · 10 days

Cluster tune-up

$35K

Flat fee · underperformance confirmed

Baseline audit + NCCL_DEBUG log analysis
Root cause isolation via binary bisection methodology
Fabric remediation: PFC, ECN, NCCL env vars, GPUDirect RDMA
Pre/post goodput benchmark comparison
Documented remediation runbook with rollback paths

GHA · annual

Health audit retainer

$40–80K

Per year · post first engagement

Quarterly NCCL regression + RYG health report
Firmware drift monitoring + watchlist alerts
Annual drift analysis with validated stack update
48-hr incident response SLA (2 incidents/yr included)
Priority scheduling on new engagements

// how it works

From discovery call to signed report in two weeks.

Every engagement follows the same four steps. No discovery retainer, no requirements phase, no project kickoff theater.

STEP 01

Discovery call

30 minutes. You describe your cluster profile, workload, timeline, and any symptoms. I confirm the right engagement and send a SOW within 24 hours.

STEP 02

SOW + access

Countersign the SOW, provide SSH access or run the pre-built log-bundle script. No complex onboarding. Engagement starts within your specified window.

STEP 03

Validation runs

I run the full toolchain — DCGM, nccl-tests, nvbandwidth, ibdiagnet, NCCL_DEBUG analysis — against your specific hardware and fabric. No templated checklists.

STEP 04

Signed report

RAG-scored report with every benchmark result, every anomaly, every remediation recommendation — and a signed acceptance recommendation your team can act on.

// book a call

What's your cluster doing right now?

If you have a delivery window coming up, a cluster that's underperforming, or a production cluster that's never been regression-tested — send a note. I'll tell you in one call whether there's a problem and what it would take to fix it.

NAME

COMPANY

NOTE

Or reach out on LinkedIn · Response within 24 hours

Your cluster passed delivery. Does it pass production?

Every hour of degraded goodput has a dollar figure.

Three engagements. One validator.

From discovery call to signed report in two weeks.

What's your cluster doing right now?

Your cluster passed delivery.
Does it pass production?