Kedios Infrastructure Report

B300 Farm — Infrastructure & Supporting Systems

March 30, 2026Cooling, power distribution, storage integration, and supporting-system planning for the 32-node Kedios B300 package under the refreshed network baseline

B300 Farm — Infrastructure & Supporting Systems

Date: March 30, 2026

Scope: Cooling, power distribution, storage integration, and supporting-system planning for the 32-node Kedios B300 package under the refreshed network baseline

Companion to: B300 Farm and Network Tower Architecture Specification


Overview

The compute side of the Kedios package remains a 32-node B300 farm. The infrastructure story now needs to support a 72-node-standard-aligned network baseline around that 32-node compute population. That means:

  • compute-side thermal and rack-power numbers remain valid where they are tied directly to the 32 deployed servers
  • facility allocation increases from 1.0 MW to 1.5 MW
  • the current package still remains constrained by the retained 20 kW/rack rule, which caps the 34-rack package at 680 kW
  • the refreshed network stack adds SN5610, SN4700, SN2201, and a second UFM node, so exact all-network heat and power totals must be finalized through the refreshed procurement BOM rather than guessed in this summary

1. Cooling Infrastructure

1.1 Validated Compute-Side Heat Anchors

SourceValueStatus
32 compute racks sustained~464 kWStill valid
32 compute racks burst peak~492 kWStill valid
Historical legacy package peak~530 kWLast fully validated pre-refresh all-in package figure

1.2 Cooling Architecture

The strongest repo-backed cooling basis remains:

  • contained front-to-back air cooling
  • hot/cold aisle containment
  • 4 total CRAC/CRAH units
  • 3 active + 1 standby for N+1 coverage
  • 650+ kW installed cooling capacity

That remains the correct cooling statement for the package at this stage. It is also the wording that should feed questionnaire row E13 and related external material.

1.3 Cooling Rule Under The Refresh

Do not convert the new 1.5 MW allocation into a claim that the current package needs 1.5 MW of cooling today. The correct cooling interpretation is:

  1. the facility envelope is larger
  2. the current rack-constrained package is still much smaller
  3. the refreshed network stack needs a later BOM-based heat roll-up before publishing a new all-in thermal total

2. Power Distribution Infrastructure

2.1 Power Framing

ItemValue
Total facility power capacity20 MW
Kedios allocation1.5 MW
Per-rack hard limit20 kW
Current package rack ceiling680 kW
Current 32 compute racks sustained~464 kW
Current 32 compute racks burst peak~492 kW

2.2 Required Wording Rule

Use both of these statements together in all planning and questionnaire material:

  • Facility allocation / available envelope: 1.5 MW
  • Current 34-rack package hard ceiling under existing rack assumptions: 680 kW

If those two layers are blurred together, the power story becomes internally inconsistent.

2.3 PDU and Feed Model

The existing rack-power architecture still applies:

  • dual-feed PDU-A / PDU-B model
  • one A-feed and one B-feed path per occupied rack
  • per-rack monitoring retained as the preferred operating model

Minimum current-package planning count:

ItemCount
Occupied racks34
Minimum rack PDUs at 2 per rack68

2.4 Network-Side Power Refresh Rule

The refreshed network stack now includes the following additional hardware families:

  • SN5610 ×6
  • SN4700 ×4
  • SN2201 ×17
  • second UFM node

Do not publish a new consolidated network-power subtotal until the refreshed BOM carries the final per-model wattage entries for those families.


3. Storage and Service Integration

3.1 Storage Position

High-speed storage is now treated as a topology-locked supporting layer, not merely a future possibility. The refreshed baseline expects storage integration through the SN5610 standard-aligned layer.

3.2 What Is Locked Versus What Is Still BOM-Level

TopicCurrent position
Storage-network layer existsLocked
Storage integration pathThrough standard-aligned SN5610 layer
Exact storage node countFinal BOM item
Exact storage server SKU / rack-U planFinal BOM item
Storage management pathThrough SN2201-based management layer

3.3 Management Integration

The previous single 96-port OOB switch story is withdrawn. Supporting systems should now assume:

  • server management through the SN2201-based management layer
  • border and control integration through SN4700
  • IB fabric management through dual UFM nodes

The server-side management-port rule remains unchanged:

  • X710 Port 0 = OS management
  • X710 Port 1 = BMC / IPMI

4. Rack and Zone Planning

ZoneCurrent rule
Compute zone32 fixed compute racks
Contracted package34 occupied racks
Network/services areaN1–N6 should be treated as the logical placement envelope for the refreshed network stack
Final placementMust be locked in the refreshed draw.io sources

The old N1 occupied, N2 occupied, N3–N6 empty statement should not be reused in updated materials.


5. Supporting-System Reporting Rules

These are the rules that should now drive downstream reports and questionnaire answers:

  1. Keep 32 deployed nodes and 72-node standard baseline explicitly separated.
  2. Use 650+ kW, 4 CRAC/CRAH, 3 active + 1 standby as the cooling basis unless new cooling evidence is added.
  3. Use 1.5 MW as the facility-allocation answer and 680 kW as the current rack-constrained package ceiling.
  4. Do not reuse the old 1 MW, single UFM, or single generic OOB switch wording.
  5. Do not publish a refreshed all-network power or heat subtotal until the final BOM provides the missing switch power entries.

Glossary

NDR
Next Data Rate — InfiniBand generation at 400 Gb/s (NDR400) or 800 Gb/s (NDR800) per physical port.
NDR400
InfiniBand NDR at 400 Gb/s per port, used by the BlueField-3 DPU for side-fabric connections.
NDR800
InfiniBand NDR at 800 Gb/s per port, used by ConnectX-8 HCAs on the HGX B300 GPU-to-fabric links.
ConnectX-8
NVIDIA ConnectX-8 NDR800 InfiniBand HCA integrated on the HGX B300 tray — 8 per server, one per GPU rail.
BlueField-3
NVIDIA BF-3220 DPU — 400G NDR400 InfiniBand, provides side-fabric connectivity and in-network compute offload.
Q3400-RA
NVIDIA Quantum-X800 Q3400 Rail-Accelerated InfiniBand switch — 144 NDR ports; deployed as 8 leaf + 4 spine.
Spectrum-4
NVIDIA Spectrum-4 400GbE/InfiniBand Ethernet switch — 51.2 Tb/s; retained as active-active side-fabric pair.
SN5610
NVIDIA Spectrum-SN5610 converged 400G Ethernet switch — 6 units in the storage/converged service plane.
SN4700
NVIDIA Spectrum-SN4700 400G Ethernet switch — 4 units for border/WAN handoff and control-plane.
SN2201
NVIDIA Spectrum-SN2201 1G/10G management switch — 17 units covering the full OOB management layer.
UFM
Unified Fabric Manager — NVIDIA IB fabric management; deployed as 2-node HA pair (production + standby).
SHARP
Scalable Hierarchical Aggregation and Reduction Protocol — in-network collective offload on Q3400-RA.
HGX B300
NVIDIA HGX Blackwell Ultra B300 — 8-GPU tray with NVLink Gen 5 at 1.8 TB/s per GPU, 14.4 TB/s aggregate across the tray.
B300 GPU
NVIDIA Blackwell Ultra B300 — 288 GB HBM3e, 1.1 kW TDP; current report basis uses ~4.5 PFLOPS FP8 dense / ~9 PFLOPS FP8 sparse and ~15 PFLOPS NVFP4 dense / ~30 PFLOPS NVFP4 sparse per GPU.
NVLink
NVIDIA direct GPU interconnect — Gen 5 on Blackwell at 1.8 TB/s per GPU, yielding 14.4 TB/s across an 8-GPU HGX B300 tray.
HBM3e
High Bandwidth Memory 3e — stacked DRAM in B300 GPUs at 288 GB per GPU, 8 TB/s peak bandwidth.
Fat-Tree
Network topology providing non-blocking bisection bandwidth; IB compute fabric is a 2-tier rail-optimised fat-tree.
Rail-Optimised
IB fabric layout: each GPU rail maps to a dedicated leaf switch, keeping AllReduce traffic rail-local.
AOC
Active Optical Cable — fibre-based cable with integrated E/O conversion, used for all NDR800 IB inter-rack links.
IPMI / BMC
Intelligent Platform Management Interface / Baseboard Management Controller — out-of-band server management.
PDU-A / PDU-B
Dual-feed power distribution: each PSU bank pairs with one PDU, giving N+5 PSU + dual-feed facility redundancy.
CRAC / CRAH
Computer Room Air Conditioner / Air Handler — precision cooling units, N+1 target coverage in the Kedios facility.
DPU
Data Processing Unit — BlueField-3 Smart NIC providing network/storage offload and security isolation.
XA NB3I-E12
ASUS server SKU: 9U air-cooled, dual Xeon 6776P, 32 × 128 GB DDR5 (4 TB total), 10× NVMe, HGX B300 ×8, CX-8 ×8, BF-3 ×2.
Xeon 6776P
Intel Xeon 6 Granite Rapids-SP — 56-core, PCIe 5.0 host CPU in the XA NB3I-E12 server; current server power tables in this repo model ~350 W per socket.
NVFP4
NVIDIA FP4 format — current report basis uses ~15 PFLOPS dense / ~30 PFLOPS sparse per B300 GPU, reported in this repo as ~240 PFLOPS sparse per 8-GPU server.
FP8
8-bit float — current report basis uses ~4.5 PFLOPS dense / ~9 PFLOPS sparse per B300 GPU, with the report itself citing ~36 PFLOPS dense per 8-GPU server.
AllReduce
Distributed-training collective operation across all GPUs; accelerated by IB fat-tree fabric and SHARP.
Fat-Tree Bisection BW
204.8 Tb/s across the full 32-server farm — 1:1 non-blocking, no fabric oversubscription.
20 kW Rack Limit
Hard power cap per rack in the Kedios facility; servers draw ~14.5 kW sustained, leaving 5.5 kW margin.