- AWS says it has finally made flat networking practical at hyperscale
- Its Resilient Network Graphs (RNG) architecture relies on three key innovations — ShuffleBox, Spraypoint and new fiber connector
- Flat network performance improvements and power savings could pay off for any large scale network operator
AWS’ networking team has spent the last two decades chasing a white whale of sorts. Thanks to a few strokes of luck and a lot of math, it’s finally wrangled the beast. AWS has not only designed but also deployed a flat network at (hyper)scale, and telcos should take note.
“It’s definitely the most exciting thing I’ve ever done in my career,” AWS VP of Network Engineering Matt Rehder told Fierce. Here’s why.
Traditional networks are hierarchical, with multiple layers of routers and switches. Think of a company organizational structure, but you, your boss and your boss’ manager are routers. From a routing perspective, the path from point A to point B is straightforward.
But because traffic has to flow along predetermined paths and has to wait in line to speak to the manager, this can create chokepoints and single points of failure in the network. It also requires more active network components – using more energy – and leaves network bandwidth unused.
Flat networks eliminate these issues by having all the routers exist together on the same layer and communicate with one another at random. Picture a diagram of dots arranged in a circle with lines drawn to connect them all. While this is great for smaller networks, you can see how things can quickly get messy as you add more dots – erm, routers.
This is the problem AWS – and other hyperscalers – have been up against.
“The reason it had remained theory until now is actually randomly wiring those things up at a full data center scale is really hard. You can’t just have spaghetti cables running all over the place. It’s not operable,” Rehder said. “The other piece is actually figuring out how to program the forwarding table on each switch to get to any destination is also hard.”
AWS’ answer to the flat network conundrum
To solve these problems, AWS came up with three key innovations: Shufflebox, Spraypoint and new connectors. These are the foundation of its new Resilient Network Graphs (RNG) architecture.
ShuffleBox is an AWS invention which enables random routing without the need for spaghetti cabling. Rehder said the ShuffleBox “mimics the structure of a traditional Clos network” to normalize the amount of cables you need to connect to it. When multiple ShuffleBoxes are connected, you get the random routing effect needed for the flat network to work.
Spraypoint, meanwhile, ensures that a source router sprays all the data to all of its neighbor routers and subsequently points them along the best path to get to their destination. While it may sound kind of like the packet spraying approach OpenAI is using in its Multipath Reliable Connection architecture, Rehder said Spraypoint operates at the network layer (Layer 3) as opposed to packet spaying, which works at the transport layer (Layer 4).
Finally, there are the connectors. While ShuffleBox helped AWS cut down on cable connections, Rehder said there are still a ton of fiber connections that need to be made. Plugging these all in can be a tedious and even onerous task because each plug can be uniquely sensitive. AWS wanted fewer, easier connections to speed deployments and make life easier for technicians. So, it built bigger plugs with locking mechanisms on them to streamline the number of connections that need to be made and ensure each plug is properly aligned.
Put it all together and you have RNG.
Rehder said RNG has already been rolled out in one data center in Spain and one in Germany, with “many more” deployments coming before the end of this year. He added RNG will be “the default network for us going forward,” but noted it won’t rip and replace the old networks running at its existing data centers until router refresh cycles kick in.
Why it matters
According to AWS, RNG performs up to a third faster than traditional networks and uses up to 40% less power. It also has better resiliency thanks to the redundant pathways that are now available between routers.
That bit about power reduction is key. “The network uses a meaningful amount of power in a data center,” Rehder said. “So by reducing that bit by 40%, it means more racks of servers in every building.”
While it’s taking a moment to celebrate, AWS’ work isn’t done. Rehder said it already has a roadmap of improvements it wants to make to RNG over the next two years or so, things will make it easier to deploy, reduce costs and slash power consumption even further.
It’s worth noting that AWS isn’t the only one working on this front. Microsoft in November revealed its new Fairwater AI data center campus in Wisconsin uses a flat network architecture that can connect hundreds of thousands of GPUs, though this appears to be a two-tier ethernet-based backend network. And last month, Google touted its Virgo Network, a flat, two-layer multi-planar architecture built to serve its AI hypercomputer and AI data centers. Again, though, this also appears to be a two-tier network rather than the single layer AWS has built.
“Anyone that’s building any sort of network at scale, this is a better way to do it,” Rehder concluded. “This would definitely benefit telcos.”