Skip to main content

How Networks Cope with AI Demands: Scale Up, Scale Out, and Scale Across

[HPP] Bob MetcalfeJanuary 7, 202622 min
42 connections·40 entities in this video→

Understanding AI Networking

  • πŸ’‘ The episode clarifies the difference between AI for networks (using AI to manage existing networks) and networking for AI (building networks to handle modern AI workloads).
  • 🧠 Praful Lalchandani, VP of Product Management at HP Networking, highlights that networking is crucial for enabling AI infrastructure, especially for distributed computing problems involving thousands of GPUs.

Evolution of Ethernet

  • πŸ“œ Ethernet, invented by Bob Metcalfe and David Boggs in 1973, initially used coaxial copper cables and a protocol called Carrier Sense Multiple Access with Collision Detection (CSMA/CD).
  • πŸš€ While originally designed for best-effort delivery, modern Ethernet has adapted to be point-to-point and is being upleveled with significant investment to meet the congestion-free and low-latency performance needs of AI workloads.

AI Network Scaling Architectures

  • 🧩 Scale up networking refers to interconnects within a single rack of GPUs, treating them as a unified memory address space.
  • 🌐 Scale out networking connects multiple racks of GPUs for training across thousands of units, often using a leaf-spine fabric.
  • 🌍 Scale across networking links multiple data centers, often 10-20 kilometers apart, to overcome power constraints in single locations and deploy training jobs across geographically dispersed GPUs.

Addressing Network Challenges

  • ⚑ Bandwidth needs for AI are "insane," with speeds rapidly increasing from 400 gig to 800 gig per GPU, far exceeding the 25 gig typical for general-purpose data centers.
  • 🚦 Unlike standard Ethernet, AI networks require lossless and congestion-free operation to prevent expensive GPUs from sitting idle, achieved through technologies that monitor and route traffic in real-time.
  • πŸ”Œ Copper cables are suitable for short distances (2-3 meters) within a rack, but fiber optics are essential for longer distances in scale-out and scale-across architectures, especially when encryption is needed for inter-data center links.
  • πŸ’§ Liquid cooling is increasingly adopted for networking infrastructure in AI data centers, not just for GPUs, to improve power efficiency and leverage existing liquid-cooled GPU rack setups.

Future Outlook for AI Networking

  • πŸ“ˆ Networks will continue to adapt rapidly as GPUs evolve and demand more speed, making it a constantly moving space.
  • 🌌 While consumer home networks won't see terabit speeds, the advancements in powerful GPUs and networking are already benefiting consumers indirectly.
  • πŸ”¬ The ability to fabricate chips at more minute dimensions suggests that the "sky is the limit" for increasing data transfer capabilities, at least for the foreseeable future.
Knowledge graph40 entities Β· 42 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover Β· drag to explore
40 entities
Chapters12 moments

Key Moments

Transcript84 segments

Full Transcript

Topics14 themes

What’s Discussed

Artificial Intelligence (AI)Networking for AIEthernetGPUs (Graphics Processing Units)Distributed ComputingScale Up NetworkingScale Out NetworkingScale Across NetworkingData CentersBandwidthCongestion ManagementFiber OpticsLiquid CoolingChip Fabrication
Smart Objects40 Β· 42 links
PeopleΒ· 4
ProductsΒ· 6
MediaΒ· 1
ConceptsΒ· 22
CompaniesΒ· 5
EventΒ· 1
LocationΒ· 1