Kedios Infrastructure Report
B300 Farm — Infrastructure & Supporting Systems
B300 Farm — Infrastructure & Supporting Systems
Date: March 30, 2026
Scope: Cooling, power distribution, storage integration, and supporting-system planning for the 32-node Kedios B300 package under the refreshed network baseline
Companion to: B300 Farm and Network Tower Architecture Specification
Overview
The compute side of the Kedios package remains a 32-node B300 farm. The infrastructure story now needs to support a 72-node-standard-aligned network baseline around that 32-node compute population. That means:
- compute-side thermal and rack-power numbers remain valid where they are tied directly to the 32 deployed servers
- facility allocation increases from 1.0 MW to 1.5 MW
- the current package still remains constrained by the retained 20 kW/rack rule, which caps the 34-rack package at 680 kW
- the refreshed network stack adds
SN5610,SN4700,SN2201, and a secondUFMnode, so exact all-network heat and power totals must be finalized through the refreshed procurement BOM rather than guessed in this summary
1. Cooling Infrastructure
1.1 Validated Compute-Side Heat Anchors
| Source | Value | Status |
|---|---|---|
| 32 compute racks sustained | ~464 kW | Still valid |
| 32 compute racks burst peak | ~492 kW | Still valid |
| Historical legacy package peak | ~530 kW | Last fully validated pre-refresh all-in package figure |
1.2 Cooling Architecture
The strongest repo-backed cooling basis remains:
- contained front-to-back air cooling
- hot/cold aisle containment
- 4 total CRAC/CRAH units
- 3 active + 1 standby for N+1 coverage
- 650+ kW installed cooling capacity
That remains the correct cooling statement for the package at this stage. It is also the wording that should feed questionnaire row E13 and related external material.
1.3 Cooling Rule Under The Refresh
Do not convert the new 1.5 MW allocation into a claim that the current package needs 1.5 MW of cooling today. The correct cooling interpretation is:
- the facility envelope is larger
- the current rack-constrained package is still much smaller
- the refreshed network stack needs a later BOM-based heat roll-up before publishing a new all-in thermal total
2. Power Distribution Infrastructure
2.1 Power Framing
| Item | Value |
|---|---|
| Total facility power capacity | 20 MW |
| Kedios allocation | 1.5 MW |
| Per-rack hard limit | 20 kW |
| Current package rack ceiling | 680 kW |
| Current 32 compute racks sustained | ~464 kW |
| Current 32 compute racks burst peak | ~492 kW |
2.2 Required Wording Rule
Use both of these statements together in all planning and questionnaire material:
- Facility allocation / available envelope: 1.5 MW
- Current 34-rack package hard ceiling under existing rack assumptions: 680 kW
If those two layers are blurred together, the power story becomes internally inconsistent.
2.3 PDU and Feed Model
The existing rack-power architecture still applies:
- dual-feed PDU-A / PDU-B model
- one A-feed and one B-feed path per occupied rack
- per-rack monitoring retained as the preferred operating model
Minimum current-package planning count:
| Item | Count |
|---|---|
| Occupied racks | 34 |
| Minimum rack PDUs at 2 per rack | 68 |
2.4 Network-Side Power Refresh Rule
The refreshed network stack now includes the following additional hardware families:
SN5610 ×6SN4700 ×4SN2201 ×17- second
UFMnode
Do not publish a new consolidated network-power subtotal until the refreshed BOM carries the final per-model wattage entries for those families.
3. Storage and Service Integration
3.1 Storage Position
High-speed storage is now treated as a topology-locked supporting layer, not merely a future possibility. The refreshed baseline expects storage integration through the SN5610 standard-aligned layer.
3.2 What Is Locked Versus What Is Still BOM-Level
| Topic | Current position |
|---|---|
| Storage-network layer exists | Locked |
| Storage integration path | Through standard-aligned SN5610 layer |
| Exact storage node count | Final BOM item |
| Exact storage server SKU / rack-U plan | Final BOM item |
| Storage management path | Through SN2201-based management layer |
3.3 Management Integration
The previous single 96-port OOB switch story is withdrawn. Supporting systems should now assume:
- server management through the SN2201-based management layer
- border and control integration through SN4700
- IB fabric management through dual UFM nodes
The server-side management-port rule remains unchanged:
- X710 Port 0 = OS management
- X710 Port 1 = BMC / IPMI
4. Rack and Zone Planning
| Zone | Current rule |
|---|---|
| Compute zone | 32 fixed compute racks |
| Contracted package | 34 occupied racks |
| Network/services area | N1–N6 should be treated as the logical placement envelope for the refreshed network stack |
| Final placement | Must be locked in the refreshed draw.io sources |
The old N1 occupied, N2 occupied, N3–N6 empty statement should not be reused in updated materials.
5. Supporting-System Reporting Rules
These are the rules that should now drive downstream reports and questionnaire answers:
- Keep 32 deployed nodes and 72-node standard baseline explicitly separated.
- Use 650+ kW, 4 CRAC/CRAH, 3 active + 1 standby as the cooling basis unless new cooling evidence is added.
- Use 1.5 MW as the facility-allocation answer and 680 kW as the current rack-constrained package ceiling.
- Do not reuse the old
1 MW,single UFM, orsingle generic OOB switchwording. - Do not publish a refreshed all-network power or heat subtotal until the final BOM provides the missing switch power entries.
Glossary
- NDR
- Next Data Rate — InfiniBand generation at 400 Gb/s (NDR400) or 800 Gb/s (NDR800) per physical port.
- NDR400
- InfiniBand NDR at 400 Gb/s per port, used by the BlueField-3 DPU for side-fabric connections.
- NDR800
- InfiniBand NDR at 800 Gb/s per port, used by ConnectX-8 HCAs on the HGX B300 GPU-to-fabric links.
- ConnectX-8
- NVIDIA ConnectX-8 NDR800 InfiniBand HCA integrated on the HGX B300 tray — 8 per server, one per GPU rail.
- BlueField-3
- NVIDIA BF-3220 DPU — 400G NDR400 InfiniBand, provides side-fabric connectivity and in-network compute offload.
- Q3400-RA
- NVIDIA Quantum-X800 Q3400 Rail-Accelerated InfiniBand switch — 144 NDR ports; deployed as 8 leaf + 4 spine.
- Spectrum-4
- NVIDIA Spectrum-4 400GbE/InfiniBand Ethernet switch — 51.2 Tb/s; retained as active-active side-fabric pair.
- SN5610
- NVIDIA Spectrum-SN5610 converged 400G Ethernet switch — 6 units in the storage/converged service plane.
- SN4700
- NVIDIA Spectrum-SN4700 400G Ethernet switch — 4 units for border/WAN handoff and control-plane.
- SN2201
- NVIDIA Spectrum-SN2201 1G/10G management switch — 17 units covering the full OOB management layer.
- UFM
- Unified Fabric Manager — NVIDIA IB fabric management; deployed as 2-node HA pair (production + standby).
- SHARP
- Scalable Hierarchical Aggregation and Reduction Protocol — in-network collective offload on Q3400-RA.
- HGX B300
- NVIDIA HGX Blackwell Ultra B300 — 8-GPU tray with NVLink Gen 5 at 1.8 TB/s per GPU, 14.4 TB/s aggregate across the tray.
- B300 GPU
- NVIDIA Blackwell Ultra B300 — 288 GB HBM3e, 1.1 kW TDP; current report basis uses ~4.5 PFLOPS FP8 dense / ~9 PFLOPS FP8 sparse and ~15 PFLOPS NVFP4 dense / ~30 PFLOPS NVFP4 sparse per GPU.
- NVLink
- NVIDIA direct GPU interconnect — Gen 5 on Blackwell at 1.8 TB/s per GPU, yielding 14.4 TB/s across an 8-GPU HGX B300 tray.
- HBM3e
- High Bandwidth Memory 3e — stacked DRAM in B300 GPUs at 288 GB per GPU, 8 TB/s peak bandwidth.
- Fat-Tree
- Network topology providing non-blocking bisection bandwidth; IB compute fabric is a 2-tier rail-optimised fat-tree.
- Rail-Optimised
- IB fabric layout: each GPU rail maps to a dedicated leaf switch, keeping AllReduce traffic rail-local.
- AOC
- Active Optical Cable — fibre-based cable with integrated E/O conversion, used for all NDR800 IB inter-rack links.
- IPMI / BMC
- Intelligent Platform Management Interface / Baseboard Management Controller — out-of-band server management.
- PDU-A / PDU-B
- Dual-feed power distribution: each PSU bank pairs with one PDU, giving N+5 PSU + dual-feed facility redundancy.
- CRAC / CRAH
- Computer Room Air Conditioner / Air Handler — precision cooling units, N+1 target coverage in the Kedios facility.
- DPU
- Data Processing Unit — BlueField-3 Smart NIC providing network/storage offload and security isolation.
- XA NB3I-E12
- ASUS server SKU: 9U air-cooled, dual Xeon 6776P, 32 × 128 GB DDR5 (4 TB total), 10× NVMe, HGX B300 ×8, CX-8 ×8, BF-3 ×2.
- Xeon 6776P
- Intel Xeon 6 Granite Rapids-SP — 56-core, PCIe 5.0 host CPU in the XA NB3I-E12 server; current server power tables in this repo model ~350 W per socket.
- NVFP4
- NVIDIA FP4 format — current report basis uses ~15 PFLOPS dense / ~30 PFLOPS sparse per B300 GPU, reported in this repo as ~240 PFLOPS sparse per 8-GPU server.
- FP8
- 8-bit float — current report basis uses ~4.5 PFLOPS dense / ~9 PFLOPS sparse per B300 GPU, with the report itself citing ~36 PFLOPS dense per 8-GPU server.
- AllReduce
- Distributed-training collective operation across all GPUs; accelerated by IB fat-tree fabric and SHARP.
- Fat-Tree Bisection BW
- 204.8 Tb/s across the full 32-server farm — 1:1 non-blocking, no fabric oversubscription.
- 20 kW Rack Limit
- Hard power cap per rack in the Kedios facility; servers draw ~14.5 kW sustained, leaving 5.5 kW margin.