Kedios Infrastructure Report

B300 Farm — Infrastructure & Supporting Systems

March 30, 2026Cooling, power distribution, storage integration, and supporting-system planning for the 32-node Kedios B300 package under the refreshed network baseline

B300 Farm — Infrastructure & Supporting Systems

Date: March 30, 2026

Scope: Cooling, power distribution, storage integration, and supporting-system planning for the 32-node Kedios B300 package under the refreshed network baseline

Companion to: B300 Farm and Network Tower Architecture Specification

Overview

The compute side of the Kedios package remains a 32-node B300 farm. The infrastructure story now needs to support a 72-node-standard-aligned network baseline around that 32-node compute population. That means:

compute-side thermal and rack-power numbers remain valid where they are tied directly to the 32 deployed servers
facility allocation increases from 1.0 MW to 1.5 MW
the current package still remains constrained by the retained 20 kW/rack rule, which caps the 34-rack package at 680 kW
the refreshed network stack adds SN5610, SN4700, SN2201, and a second UFM node, so exact all-network heat and power totals must be finalized through the refreshed procurement BOM rather than guessed in this summary

1. Cooling Infrastructure

1.1 Validated Compute-Side Heat Anchors

Source	Value	Status
32 compute racks sustained	~464 kW	Still valid
32 compute racks burst peak	~492 kW	Still valid
Historical legacy package peak	~530 kW	Last fully validated pre-refresh all-in package figure

1.2 Cooling Architecture

The strongest repo-backed cooling basis remains:

contained front-to-back air cooling
hot/cold aisle containment
4 total CRAC/CRAH units
3 active + 1 standby for N+1 coverage
650+ kW installed cooling capacity

That remains the correct cooling statement for the package at this stage. It is also the wording that should feed questionnaire row E13 and related external material.

1.3 Cooling Rule Under The Refresh

Do not convert the new 1.5 MW allocation into a claim that the current package needs 1.5 MW of cooling today. The correct cooling interpretation is:

the facility envelope is larger
the current rack-constrained package is still much smaller
the refreshed network stack needs a later BOM-based heat roll-up before publishing a new all-in thermal total

2. Power Distribution Infrastructure

2.1 Power Framing

Item	Value
Total facility power capacity	20 MW
Kedios allocation	1.5 MW
Per-rack hard limit	20 kW
Current package rack ceiling	680 kW
Current 32 compute racks sustained	~464 kW
Current 32 compute racks burst peak	~492 kW

2.2 Required Wording Rule

Use both of these statements together in all planning and questionnaire material:

Facility allocation / available envelope: 1.5 MW
Current 34-rack package hard ceiling under existing rack assumptions: 680 kW

If those two layers are blurred together, the power story becomes internally inconsistent.

2.3 PDU and Feed Model

The existing rack-power architecture still applies:

dual-feed PDU-A / PDU-B model
one A-feed and one B-feed path per occupied rack
per-rack monitoring retained as the preferred operating model

Minimum current-package planning count:

Item	Count
Occupied racks	34
Minimum rack PDUs at 2 per rack	68

2.4 Network-Side Power Refresh Rule

The refreshed network stack now includes the following additional hardware families:

SN5610 ×6
SN4700 ×4
SN2201 ×17
second UFM node

Do not publish a new consolidated network-power subtotal until the refreshed BOM carries the final per-model wattage entries for those families.

3. Storage and Service Integration

3.1 Storage Position

High-speed storage is now treated as a topology-locked supporting layer, not merely a future possibility. The refreshed baseline expects storage integration through the SN5610 standard-aligned layer.

3.2 What Is Locked Versus What Is Still BOM-Level

Topic	Current position
Storage-network layer exists	Locked
Storage integration path	Through standard-aligned SN5610 layer
Exact storage node count	Final BOM item
Exact storage server SKU / rack-U plan	Final BOM item
Storage management path	Through SN2201-based management layer

3.3 Management Integration

The previous single 96-port OOB switch story is withdrawn. Supporting systems should now assume:

server management through the SN2201-based management layer
border and control integration through SN4700
IB fabric management through dual UFM nodes

The server-side management-port rule remains unchanged:

X710 Port 0 = OS management
X710 Port 1 = BMC / IPMI

4. Rack and Zone Planning

Zone	Current rule
Compute zone	32 fixed compute racks
Contracted package	34 occupied racks
Network/services area	`N1–N6` should be treated as the logical placement envelope for the refreshed network stack
Final placement	Must be locked in the refreshed draw.io sources

The old N1 occupied, N2 occupied, N3–N6 empty statement should not be reused in updated materials.

5. Supporting-System Reporting Rules

These are the rules that should now drive downstream reports and questionnaire answers:

Keep 32 deployed nodes and 72-node standard baseline explicitly separated.
Use 650+ kW, 4 CRAC/CRAH, 3 active + 1 standby as the cooling basis unless new cooling evidence is added.
Use 1.5 MW as the facility-allocation answer and 680 kW as the current rack-constrained package ceiling.
Do not reuse the old 1 MW, single UFM, or single generic OOB switch wording.
Do not publish a refreshed all-network power or heat subtotal until the final BOM provides the missing switch power entries.

Glossary

NDR: Next Data Rate — InfiniBand generation at 400 Gb/s (NDR400) or 800 Gb/s (NDR800) per physical port.

NDR400: InfiniBand NDR at 400 Gb/s per port, used by the BlueField-3 DPU for side-fabric connections.

NDR800: InfiniBand NDR at 800 Gb/s per port, used by ConnectX-8 HCAs on the HGX B300 GPU-to-fabric links.

ConnectX-8: NVIDIA ConnectX-8 NDR800 InfiniBand HCA integrated on the HGX B300 tray — 8 per server, one per GPU rail.

BlueField-3: NVIDIA BF-3220 DPU — 400G NDR400 InfiniBand, provides side-fabric connectivity and in-network compute offload.

Q3400-RA: NVIDIA Quantum-X800 Q3400 Rail-Accelerated InfiniBand switch — 144 NDR ports; deployed as 8 leaf + 4 spine.

Spectrum-4: NVIDIA Spectrum-4 400GbE/InfiniBand Ethernet switch — 51.2 Tb/s; retained as active-active side-fabric pair.

SN5610: NVIDIA Spectrum-SN5610 converged 400G Ethernet switch — 6 units in the storage/converged service plane.

SN4700: NVIDIA Spectrum-SN4700 400G Ethernet switch — 4 units for border/WAN handoff and control-plane.

SN2201: NVIDIA Spectrum-SN2201 1G/10G management switch — 17 units covering the full OOB management layer.

UFM: Unified Fabric Manager — NVIDIA IB fabric management; deployed as 2-node HA pair (production + standby).

SHARP: Scalable Hierarchical Aggregation and Reduction Protocol — in-network collective offload on Q3400-RA.

HGX B300: NVIDIA HGX Blackwell Ultra B300 — 8-GPU tray with NVLink Gen 5 at 1.8 TB/s per GPU, 14.4 TB/s aggregate across the tray.

B300 GPU: NVIDIA Blackwell Ultra B300 — 288 GB HBM3e, 1.1 kW TDP; current report basis uses ~4.5 PFLOPS FP8 dense / ~9 PFLOPS FP8 sparse and ~15 PFLOPS NVFP4 dense / ~30 PFLOPS NVFP4 sparse per GPU.

NVLink: NVIDIA direct GPU interconnect — Gen 5 on Blackwell at 1.8 TB/s per GPU, yielding 14.4 TB/s across an 8-GPU HGX B300 tray.

HBM3e: High Bandwidth Memory 3e — stacked DRAM in B300 GPUs at 288 GB per GPU, 8 TB/s peak bandwidth.

Fat-Tree: Network topology providing non-blocking bisection bandwidth; IB compute fabric is a 2-tier rail-optimised fat-tree.

Rail-Optimised: IB fabric layout: each GPU rail maps to a dedicated leaf switch, keeping AllReduce traffic rail-local.

AOC: Active Optical Cable — fibre-based cable with integrated E/O conversion, used for all NDR800 IB inter-rack links.

IPMI / BMC: Intelligent Platform Management Interface / Baseboard Management Controller — out-of-band server management.

PDU-A / PDU-B: Dual-feed power distribution: each PSU bank pairs with one PDU, giving N+5 PSU + dual-feed facility redundancy.

CRAC / CRAH: Computer Room Air Conditioner / Air Handler — precision cooling units, N+1 target coverage in the Kedios facility.

DPU: Data Processing Unit — BlueField-3 Smart NIC providing network/storage offload and security isolation.

XA NB3I-E12: ASUS server SKU: 9U air-cooled, dual Xeon 6776P, 32 × 128 GB DDR5 (4 TB total), 10× NVMe, HGX B300 ×8, CX-8 ×8, BF-3 ×2.

Xeon 6776P: Intel Xeon 6 Granite Rapids-SP — 56-core, PCIe 5.0 host CPU in the XA NB3I-E12 server; current server power tables in this repo model ~350 W per socket.

NVFP4: NVIDIA FP4 format — current report basis uses ~15 PFLOPS dense / ~30 PFLOPS sparse per B300 GPU, reported in this repo as ~240 PFLOPS sparse per 8-GPU server.

FP8: 8-bit float — current report basis uses ~4.5 PFLOPS dense / ~9 PFLOPS sparse per B300 GPU, with the report itself citing ~36 PFLOPS dense per 8-GPU server.

AllReduce: Distributed-training collective operation across all GPUs; accelerated by IB fat-tree fabric and SHARP.

Fat-Tree Bisection BW: 204.8 Tb/s across the full 32-server farm — 1:1 non-blocking, no fabric oversubscription.

20 kW Rack Limit: Hard power cap per rack in the Kedios facility; servers draw ~14.5 kW sustained, leaving 5.5 kW margin.