Architecture Locked · February 28, 2026
Kedios B300 Full Network
& Topology Report
32-Node NVIDIA Blackwell Ultra B300 GPU Training Farm — Complete Planning, Architecture & Topology
The Kedios B300 farm is a 32-node NVIDIA Blackwell Ultra B300 GPU training cluster deployed within a 20 MW data center, consuming approximately 530 kW peak out of a 1 MW allocated budget (53%). The farm is purpose-built for large-scale distributed AI training — specifically AllReduce/AllGather collective operations — backed by a 1:1 non-blocking InfiniBand NDR800 fabric delivering 204.8 Tb/s bisection bandwidth.
Total GPUs
256
NVIDIA B300 (HGX Blackwell Ultra)
Peak Compute FP8
1,152
PFLOPS dense
Peak Compute NVFP4
7,680
PFLOPS sparse
GPU Memory
73.73
TB HBM3e (256 × 288 GB)
IB Bisection BW
204.8
Tb/s — 1:1 non-blocking
IB Switches
12
Q3400-RA (8 leaf + 4 spine)
Farm Peak Power
~530
kW (~53% of 1 MW budget)
Power Headroom
~470
kW remaining in 1 MW budget (47%)
Burst Safety Margin
4.6
kW below 20 kW/rack hard limit
Facility Overview
| Parameter | Value |
| Total DC power capacity | 20 MW |
| Kedios allocation | 1 MW (1,000 kW) |
| Per-cabinet hard limit | 20 kW (updated wall-output hard ceiling) |
| Cooling units | CRAC #1 and CRAC #2 + A/C condensers |
| Power distribution | PDU-A and PDU-B (dual-feed per row) |
| UPS infrastructure | UPS Room #1, #2, #3 + UPS 3 & UPS 4 |
| Power control | PCU 1, PCU 2 |
Per-Server Power Audit (Component-Level)
| Component | Spec | Qty | Each | Subtotal |
| NVIDIA B300 GPU | HGX tray, TDP 1,100 W | 8 | 1,100 W | 8,800 W |
| Intel Xeon 6776P CPU | 56-core, 660 W TDP | 2 | 350 W | 700 W |
| DDR5-6400 128 GB RDIMM | Samsung M321RAJA0MB2-CCP | 32 | ~7 W | 224 W |
| Samsung PM9D3a NVMe U.2 | Gen5 1.92/3.84 TB | 10 | ~10 W | 100 W |
| ConnectX-8 NIC (on-board) | 800 Gb/s NDR per port | 8 | ~20 W | 160 W |
| BlueField-3 3220 DPU | 400 Gb/s NDR400 | 2 | ~40 W | 80 W |
| Intel X710-AT2 mgmt NIC | Dual 10 GbE | 1 | ~12 W | 12 W |
| System fans (80×80 mm) | High-RPM | 15 | ~18 W | 270 W |
| CPU fans (60×56 mm) | — | 6 | ~7 W | 42 W |
| PCIe switch PEX89144 | — | 1 | ~12 W | 12 W |
| VRMs / motherboard / misc | — | — | — | 175 W |
| Component sum | 10,575 W |
| PSU loss (80 PLUS Titanium ~95.5%) | +494 W |
| Server at wall — sustained | ~14,500 W ≈ 14.5 kW |
| In-rack PDU cabling loss (~1%) | Included in wall draw |
| Total server rack draw — sustained peak | ~14.5 kW |
| GPU instantaneous burst overshoot (~+6%) | ~15.4 kW absolute max |
Rack Power vs 20 kW Hard Limit
| Item | Value |
| 20 kW hard limit per rack | 20.0 kW |
| Server sustained peak (at wall, incl. PDU) | ~14.5 kW |
| Absolute instantaneous burst | ~15.4 kW |
| Safety margin (sustained) | ~5.5 kW ✅ |
| Safety margin (burst) | ~4.6 kW ✅ |
✅ Within limit. Server fits the 20 kW hard limit at all operational modes. Even at absolute burst peak (~15.4 kW) the margin is 4.6 kW — no risk of breaker trips.
Farm Zone Power Summary
| Zone | Rack count | Sustained peak | Burst peak | Hard limit |
| 32 compute racks (C01–C32) | 32 | ~464 kW | ~492 kW | 640 kW |
| 2 switch/fabric racks (N1–N2) | 2 | ~17.6 kW | ~17.6 kW | 40 kW |
| Cooling / UPS overhead | — | ~20 kW | ~20 kW | — |
| Total farm peak | 34 | ~501 kW | ~530 kW | 1,000 kW |
| Remaining headroom | — | ~498 kW | ~470 kW (47%) | — |
| Zone | Capacity | Allocation | Spare |
| 32-rack compute zone | 32 racks | 32× ASUS XA NB3I-E12 B300 servers (C01–C32) | 0 |
| 6-rack network zone | 6 racks | N1 (IB fabric) + N2 (Eth/UFM/OOB) | 4 (N3–N6 reserved) |
Compute Zone — 32-Rack Grid (C01–C32)
- Alternating cold aisle / hot aisle layout — conditioned air from CRAC #1 and CRAC #2
- Dual-feed from PDU-A and PDU-B across all rows — N+1 power path redundancy
- Each rack: server at U1–U9; patch panel at U11; cable management U12–U13; upper 29U empty
- Each rack exits 12 cables: 8× IB AOC + 2× Ethernet AOC + 2× Cat6A → network tower
Network Tower — N1 & N2 Rack Layout
| Rack | U position | Hardware | U used | U spare |
| N1 | U1–U48 | 12× Q3400-RA IB switches (4U each = 48U) | 48U | — |
| N2 | U1–U2 | Spectrum-4 #1 (2U) | 6U | 36U |
| U3–U4 | Spectrum-4 #2 (2U) |
| U5 | UFM Appliance (1U) |
| U6 | OOB 10 GbE Management Switch (1U) |
| N3–N6 | — | Empty — reserved for future scale-out | 0U | 4 × 42U |
Design principle: All leaf-to-spine cables (inter-switch, intra-tower) are short passive copper DAC ≤3 m — all switches co-located in N1. Long AOC runs only cross the compute-to-network inter-zone gap (5–10 m).
Form factor: 9U, air-cooled (front-to-back), direct airflow | Quantity: 32 units (one per rack, C01–C32)
Per-Server Hardware BOM
| Component | Model / Spec | Qty |
| GPU | NVIDIA Blackwell Ultra B300 (HGX tray), TDP 1,100 W | 8 |
| GPU memory | 288 GB HBM3e per GPU (12-high stacks) = 2.304 TB / server | — |
| CPU | Intel Xeon Platinum 6776P, 56 cores, 660 W TDP | 2 |
| System RAM | Samsung M321RAJA0MB2-CCP 128 GB DDR5-6400 RDIMM = 4 TB | 32 |
| Boot SSD | Samsung PM9D3a U.2 Gen5 NVMe 1.92 TB | 2 |
| Data SSD | Samsung PM9D3a U.2 Gen5 NVMe 3.84 TB | 8 |
| IB NIC (GPU fabric) | ConnectX-8 (CX8), soldered on HGX tray, 800 Gb/s NDR IB | 8 |
| DPU | NVIDIA BlueField-3 3220, 400 Gb/s NDR400 Ethernet | 2 |
| Management NIC | Intel X710-AT2, dual-port 10 GbE RJ45 | 1 |
| TPM | TPM security module | 1 |
| NVLink interconnect | NVLink Gen 5 (HGX tray) — 1.8 TB/s per GPU, 14.4 TB/s total | — |
| PSU | 80 PLUS Titanium, 5+5 array (N+5 redundancy + dual-feed) | 10 |
| UFM Agent | Software only — installed on server OS | 1 |
Per-Server Performance
| Metric | Value |
| GPU compute (FP8 dense, 8-GPU) | ~36 PFLOPS |
| GPU compute (NVFP4 sparse, 8-GPU) | ~240 PFLOPS |
| Intra-server NVLink bandwidth | 14.4 TB/s |
| GPU-to-GPU latency (intra-server) | < 100 ns |
| System memory bandwidth | ~600 GB/s |
| Total NVMe storage | 34.56 TB |
| Operating power | ~14.5 kW |
| Peak sustained wall draw | ~14.5 kW |
| Absolute burst peak | ~15.4 kW |
| Per-rack hard limit | 20 kW |
32-Server Farm Aggregates
| Metric | Value |
| Total GPUs | 256 × NVIDIA B300 |
| Total GPU memory | 73.73 TB HBM3e |
| Total system RAM | 128 TB DDR5 |
| Total NVMe storage | ~1,106 TB |
| Peak compute (FP8 dense) | ~1,152 PFLOPS |
| Peak compute (NVFP4 sparse) | ~7,680 PFLOPS |
| NVLink aggregate (farm) | 32 × 14.4 TB/s = 460.8 TB/s |
| Peak power (sustained wall) | ~464 kW |
| Absolute burst peak | ~492 kW |
| Thermal output | ~355 kW = ~1,211,000 BTU/hr |
Network Ports Per Server
| Port type | Component | Count | Speed | Target |
| CX8 IB (GPU fabric) | HGX tray on-board (1 per GPU) | 8 | 800 Gb/s NDR | Q3400-RA Leaf switches |
| BF3 Ethernet (storage) | BlueField-3 3220 add-in card | 2 | 400 Gb/s NDR400 | Spectrum-4 (active-active) |
| X710 management | Intel X710-AT2 dual-port | 2 | 10 GbE | OOB Management switch |
| Total per server | | 12 | | |
32-Server Farm Port Totals
| Fabric | Per server | × 32 servers | Speed | Target switch |
| CX8 IB (compute) | 8 | 256 ports | 800 Gb/s NDR | Q3400-RA leaf (L0–L7) |
| BF3 Ethernet (storage) | 2 | 64 ports | 400 Gb/s NDR400 | Spectrum-4 #1 + #2 (active-active) |
| OOB management | 2 | 64 server + 16 infra = 80 total | 10 GbE | OOB switch (96-port) |
InfiniBand GPU Rail Assignment
Each server's 8 CX8 ports map to 8 independent GPU rails. All GPU-i ports across all 32 servers land on the same leaf switch Li, isolating GPU-to-GPU communication by rail.
Rail isolation principle: AllReduce within a rail (the dominant training pattern) traverses zero spine hops — leaf-only, sub-100 ns latency. Only cross-rail AllToAll requires the 2-hop leaf→spine→leaf path.
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
SPINES │ Spine S0 │ │ Spine S1 │ │ Spine S2 │ │ Spine S3 │
(×4) │ Q3400-RA │ │ Q3400-RA │ │ Q3400-RA │ │ Q3400-RA │
│ 144p NDR │ │ 144p NDR │ │ 144p NDR │ │ 144p NDR │
└──────┬─────┘ └──┬─────────┘ └────────┬───┘ └─────┬──────┘
│ (32 uplinks per leaf × 8 leaves = 256 spine ports)
┌────────────┼──────────┼────────────────────┼───────────┼─────┐
│ │ │ │ │ │
┌─┴───────┐ ┌──┴──────┐ ... ┌───────────┐ ... ┌────────┴──┐ │
│ Leaf L0 │ │ Leaf L1 │ │ Leaf L6 │ │ Leaf L7 │ │
│ Rail 0 │ │ Rail 1 │ │ Rail 6 │ │ Rail 7 │ │
│Q3400-RA │ │Q3400-RA │ │ Q3400-RA │ │ Q3400-RA │ │
└────┬────┘ └────┬────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │
(32 downlinks per leaf — one CX8[n] from every server)
│ │ │ │
C01..C32 C01..C32 C01..C32 C01..C32
CX8[0] CX8[1] CX8[6] CX8[7]
NVIDIA Quantum-X800 Q3400-RA Specifications
| Parameter | Value |
| Form factor | 4U chassis |
| Port count | 144 × 800 Gb/s NDR IB (72 OSFP, bifurcated) |
| Switching capacity | 115.2 Tb/s non-blocking |
| UFM management port | Dedicated OSFP 400 Gb/s (isolated from data ports) |
| SHARP version | Gen 4 — in-network AllReduce acceleration |
| Routing | Adaptive routing + telemetry congestion control |
| SerDes | 200 Gb/s |
| Protocol | InfiniBand NDR, RDMA |
Leaf Switch Design (×8 — L0 through L7)
| Parameter | Value |
| Count | 8 leaf switches (one per GPU rail) |
| Server downlinks | 32 (one CX8 port from each server on that rail) |
| Spine uplinks | 32 (8 parallel links to each of 4 spines) |
| Port utilization | 64 of 144 (44%) — room for ~32 more servers (up to ~64 total) |
| Downlink cable type | AOC 5–10 m (server → leaf, cross-zone) |
| Uplink cable type | DAC ≤3 m (leaf → spine, intra-N1) |
Spine Switch Design (×4 — S0 through S3)
| Parameter | Value |
| Count | 4 spine switches |
| Leaf-facing ports | 64 (8 parallel uplinks × 8 leaf switches) |
| Server connections | None — pure leaf-to-leaf forwarding |
| Port utilization | 64 of 144 (44%) |
| Cable type | DAC ≤3 m (all intra-N1) |
Fat-Tree Bandwidth & Latency
| Metric | Value |
| Server-side aggregate BW | 256 × 800 Gb/s = 204.8 Tb/s |
| Spine bisection BW | 204.8 Tb/s |
| Bisection ratio | 1:1 — fully non-blocking ✅ |
| Max hops server-to-server | 2 (leaf → spine → leaf) |
| IB latency server-to-server | < 200 ns |
| Intra-rail latency (same leaf) | < 100 ns |
| SHARP Gen 4 AllReduce gain | Up to ~50% BW reduction for typical training batch sizes |
Rack N1 houses all 12 × Q3400-RA switches co-located. Co-location enables all inter-switch links to use short passive DAC — no active components in the spine-to-leaf forwarding path.
Internal Wiring (Leaf → Spine)
| Link | Cable type | Length | Count |
| Each leaf → each spine (8 parallel links per pair) | Passive copper DAC | ≤ 3 m | 256 DAC cables |
| Unique leaf-spine pairs | — | — | 32 pairs (8 leaves × 4 spines) |
SHARP Gen 4 In-Network Computation
- All 12 Q3400-RA support SHARP Gen 4
- Partial gradient sums computed inside switches during AllReduce — reduces fabric BW by up to ~50%
- Particularly impactful for LLM pretraining (GPT/LLaMA scale) with large gradient tensors
- Orchestrated automatically by UFM Appliance — no application code changes required
Physical Layout
| U position | Hardware | Role |
| U1–U2 | NVIDIA Spectrum-4 #1 | AI Ethernet — 128× 400 GbE, 51.2 Tb/s |
| U3–U4 | NVIDIA Spectrum-4 #2 | AI Ethernet — active-active pair |
| U5 | NVIDIA UFM Appliance | IB fabric manager (hardware 1U) |
| U6 | 10 GbE OOB Switch | Out-of-band management (96+ ports) |
| U7–U42 | Empty (36U) | Patch panels, cable trays, future expansion |
Spectrum-4 AI Ethernet Fabric
| Parameter | Value |
| ASIC process | TSMC 4N |
| Switching capacity | 51.2 Tb/s per switch |
| Port configuration | 128 × 400 GbE (or 64 × 800 GbE) |
| Protocol | Ethernet + RoCE (Spectrum-X AI Ethernet) |
| BF3 ports per switch | 32 (64 total ÷ 2 switches) |
| Redundancy model | Active-active — each BF3 connects one port to each Spectrum-4 |
| Inter-switch spine link | 2× 400 GbE direct link between Sp4 #1 and #2 |
UFM Appliance
| Parameter | Value |
| Form factor | 1U hardware appliance |
| Management connection | Via OOB switch (10 GbE) — not on IB data fabric |
| Manages | 12× Q3400-RA switches + 256× CX8 endpoints |
| UFM hardware capacity | Up to 648 ports — current usage well within range |
| Scale headroom | Supports doubling cluster to 64 servers without new UFM hardware |
| UFM Agent (SW) | 32 instances — per-server OS, no extra hardware |
UFM Appliance functions: topology discovery, routing table computation, adaptive routing, SHARP orchestration, fault detection/recovery, per-port telemetry aggregation.
AOC requirement: Passive copper DAC only works reliably ≤3 m at NDR speeds. The inter-zone gap between compute racks and network tower is 5–15 m, requiring Active Optical Cable (AOC) on all server-to-switch links.
| Link | Cable type | Length | Count | Notes |
| Server CX8 → Q3400-RA Leaf (NDR800) | AOC | 5–10 m | 256 | 8 per server × 32 servers |
| Server BF3 → Spectrum-4 (400 GbE NDR400) | AOC | 5–10 m | 64 | 2 per server × 32 servers |
| Server X710 → OOB switch (OS mgmt + BMC) | Cat6A copper | 5–10 m | 64 | 2 per server × 32 servers |
| Q3400-RA leaf → Q3400-RA spine (NDR800) | Passive DAC | ≤ 3 m | 256 | All intra-N1 |
| Spectrum-4 #1 ↔ #2 (inter-switch spine) | DAC or AOC | ≤ 3 m | 2 | Intra-N2 |
| UFM Appliance → OOB switch | Cat6A | ≤ 1 m | 1 | Intra-N2 |
| Q3400-RA UFM mgmt port → OOB switch | DAC / SFP+ | ≤ 3 m | 12 | Intra-N1/N2 |
Cable Totals
| Type | Count |
| AOC (OSFP, NDR800, 800 Gb/s) | 256 |
| AOC (QSFP-DD, NDR400, 400 Gb/s) | 64 |
| Cat6A copper (10 GbE) | 64 |
| Passive DAC (intra-network-tower) | ~270 |
| Total cable assemblies | ~654 |
Networking Hardware
| Component | Count | Form factor | Role |
| NVIDIA Q3400-RA — Leaf | 8 | 4U | 32 CX8 downlinks + 32 spine uplinks per switch; 1:1 non-blocking |
| NVIDIA Q3400-RA — Spine | 4 | 4U | 64 leaf-facing ports; pure any-to-any forwarding plane |
| Total Q3400-RA | 12 | — | 115.2 Tb/s each, 144 NDR800 ports each |
| NVIDIA Spectrum-4 (Ethernet) | 2 | 2U | 128× 400 GbE active-active, 51.2 Tb/s each |
| 10 GbE OOB Management Switch | 1 | 1U | 96-port; 32 OS mgmt + 32 BMC/IPMI + 12 switch mgmt + 2 Spectrum-4 + 1 UFM = 80 connections |
| NVIDIA UFM Appliance | 1 | 1U | Centralized IB fabric manager |
| UFM Agent (software) | 32 | SW only | Per-server OS agent — no extra hardware |
| Grand total (hardware) | 16 units | — | 12 Q3400-RA + 2 Spectrum-4 + 1 OOB + 1 UFM Appliance |
Rack Infrastructure
| Item | Count |
| Compute racks (42U) — C01–C32 | 32 |
| Network/switch racks (42U) — N1–N2 | 2 |
| Spare network positions — N3–N6 | 4 (reserved) |
| Total rack positions occupied | 34 |
| Fabric layer | Calculation | Total bandwidth |
| IB compute — server side | 256 CX8 × 800 Gb/s | 204.8 Tb/s |
| IB compute — spine bisection | 8 leaf × 32 uplinks × 800 Gb/s | 204.8 Tb/s (1:1 non-blocking) |
| NVLink (per server) | NVLink Gen 5, 1.8 TB/s × 8 GPUs | 14.4 TB/s |
| NVLink (farm aggregate) | 32 × 14.4 TB/s | 460.8 TB/s |
| BF3 Ethernet fabric (farm) | 64 × 400 Gb/s | 25.6 Tb/s |
| Spectrum-4 switching (combined) | 2 × 51.2 Tb/s | 102.4 Tb/s |
| OOB management | 80 × 10 GbE | 800 Gb/s |
| Metric | Per server | Farm total (×32) |
| FP8 dense (GPU peak) | ~36 PFLOPS | ~1,152 PFLOPS |
| NVFP4 sparse (GPU peak) | ~240 PFLOPS | ~7,680 PFLOPS |
| GPU memory (HBM3e) | 2.304 TB | 73.73 TB |
| HBM3e bandwidth | ~8 TB/s | ~256 TB/s |
| System RAM (DDR5) | 4 TB | 128 TB |
| NVMe storage | 34.56 TB | ~1,106 TB |
| IB network BW | 6.4 Tb/s (8 × 800 Gb/s) | 204.8 Tb/s bisection |
| Ethernet BW (BF3) | 800 Gb/s | 25.6 Tb/s |
Three diagrams were produced, available in .drawio (editable) and .png (rendered) in network-topology-diagrams/.
End-to-end connectivity block flow — shows all major components and their interconnections:
- External uplink path: Internet → Router → Firewall → Border Leaf
- Compute fabric: Spectrum-4 active-active → Q3400-RA IB Fabric (leaves + spines) → 32× B300 servers
- Management path: OOB switch → all servers (BMC + OS) + UFM Appliance → IB fabric managers
- Zone separation clearly indicated: compute zone (right) vs. network tower (center)
NVIDIA-style full IB topology — 3-row layout matching NVIDIA reference diagrams:
- Row 1 (Spines): 4× Q3400-RA S0–S3 (orange)
- Row 2 (Leaves): 8× Q3400-RA L0–L7 — each color-coded by rail (matching the 8 GPU rail colors)
- Row 3 (Servers): 32× compute nodes C01–C32 grouped in 4 sets of 8
- 256 rail-colored CX8→Leaf connections (colored by rail) + 32 orange spine-to-leaf connections
- Ethernet section: Spectrum-4 #1/#2, UFM, OOB — 64 BF3→Spectrum-4 connections
Per-port detail for a single representative server rack (C01) — C02–C32 are identical:
- 8× CX8 NDR800 AOC connections → 8 leaf switches (color-coded by rail, 2.5 px)
- 8× leaf → 4× spine links (orange, 8×4 = 32 total leaf-spine connections)
- BF3-0 → Spectrum-4 #1 | BF3-1 → Spectrum-4 #2 (purple AOC)
- X710 Port 0 + Port 1 → OOB switch (yellow dashed = management)
- UFM Appliance → server (blue dashed = UFM Agent SW) + UFM → Leaf L0 (blue dashed = IB fabric mgmt)
- Connection color legend at bottom
| Decision | Choice | Rationale |
| Server per rack | 1 server / rack (9U, ~14.5 kW) | 20 kW/rack hard limit; 5.5 kW sustained margin, 4.6 kW burst margin |
| IB switch model | Q3400-RA (144p NDR800) | NVIDIA reference for HGX B300 training; 44% port utilization = scale-out headroom |
| IB topology | 2-tier rail-optimized fat-tree | GPU rail isolation: AllReduce = 0 spine hops; 1:1 non-blocking BW guarantee |
| Leaf count | 8 leaf switches | One per GPU rail (GPU-0…GPU-7 across 32 servers) |
| Spine count | 4 spine switches | Sets bisection = server BW (1:1 non-blocking guarantee) |
| Total Q3400-RA | 12 units | 8 leaf + 4 spine; NVIDIA validated configuration for 32-node B300 clusters |
| Max hops | 2 hops | Critical for NCCL; keeps server-to-server latency < 200 ns |
| Dual BF3 / server | Active-active → 2× Spectrum-4 | Full 800 Gb/s Ethernet per server; single-switch failure resilience |
| Spectrum-4 count | 2 (active-active) | Each BF3 connects to both; per-server BW = 800 Gb/s with redundancy |
| UFM architecture | 1 appliance + 32 SW agents | UFM is centralized by design; 648-port capacity covers current + scale-out |
| Network tower isolation | Dedicated racks N1–N2 | Predictable compute rack power envelopes; independent power feeds and cooling for switches |
| Cross-zone cables | AOC 5–10 m | DAC only works ≤3 m at NDR; inter-zone runs are 5–15 m |
| Intra-tower cables | DAC ≤3 m | All switches co-located; cheaper, lower latency than AOC for short runs |
| SHARP Gen 4 | Enabled (via Q3400-RA) | Up to 50% AllReduce BW reduction at 32-node scale for LLM gradient sync |
| Air cooling | Front-to-back (per ASUS XA design) | CRAC #1 & #2 support hot/cold aisle containment; no liquid cooling required |
| Spare network positions | N3–N6 reserved | Future scale-out: additional spines/Spectrum-4 when cluster exceeds 32 servers |
| Severity | Risk | Mitigation |
| HIGH | Leaf switch failure | UFM reroutes within seconds; training checkpointing limits job loss; cluster continues at 7/8 rail capacity |
| HIGH | Thermal runaway / CRAC failure | CRAC #1 + #2 redundancy; BMC thermal throttling at GPU junction limit; B300 has independent per-chip thermal shutdown |
| MED | GPU burst overshoot trips breaker | 4.6 kW burst margin at 20 kW limit; PDU per-outlet current monitoring; BMC IPMI power capping |
| MED | Single AOC cable failure | CX8 port fails over via UFM adaptive routing; per-GPU rail isolation — 1 cable = 1 rail affected, not all GPUs |
| MED | UFM Appliance failure | IB fabric continues with stale routing tables; hot-standby UFM VM on management server recommended for production |
| LOW | Spine switch failure | 4 spines provide redundant paths; ECMP distributes across remaining 3 spines automatically |
| LOW | Spectrum-4 failure | Active-active pair; BF3 bonding fails over to surviving switch; storage BW halved but no outage |
| LOW | PDU-A failure | Dual-feed PDU-A/PDU-B; 5+5 PSU pairs each bank to one feed; single PDU failure does not crash a server |
| PLAN | Scale beyond 32 servers | Q3400-RA at 44% utilization supports ~32 more servers per leaf (up to ~64 total); 4 spare N3–N6 rack positions available |