As chief architect and principal investigator for the Aurora supercomputer at Argonne National Laboratory in Illinois, Olivier Franza plays a leading role in bringing one of the most ambitious scientific instruments – not to mention the world’s largest GPU cluster – into existence.
Aurora is among the most anticipated and highly visible projects Intel has been a part of in recent memory – a bold bet on Intel’s entire system portfolio. The machine is expected to be the first supercomputer with a peak performance reaching 2 exaflops, or 2×1018, floating point operations per second.
That puts a bit of pressure on Franza, a 22-year Intel veteran who joined the Aurora project as system hardware architect in 2016, oversaw the pivot to a GPU-based machine and became chief architect in 2021.
“The chief architect is responsible for defining the overall system architecture of the supercomputer, according to the customer’s high-level requirements,” Franza explains. “There are fundamental ones like general performance metrics and power envelope, but also inherent features like RAS – reliability, availability, serviceability – that are essential to building a scalable system.”
His responsibilities also encompass the details of the system topology from a node to a rack to the complete system, including its networking fabric and storage components.
A Roadmap Pivot Opens Opportunity to Shape Future Products
When initial planning began for Aurora, a U.S. Department of Energy-sponsored system, the design consisted of a collection of Intel technologies. However, changes to Intel’s product roadmap, notably the end of the Xeon Phi and Omnipath product families, required a restart. As Intel made plans to build data center GPUs, Franza became enmeshed in discussions on the design of the Intel® Data Center GPU Max Series (code-named Ponte Vecchio).
In this way, Aurora isn’t just a one-off system. Rather, it helped inform the Intel-wide strategy and product portfolio to address scale and performance at the highest level.
“We infused all the Aurora system-level requirements down to the components’ level,” Franza says.
The architecture and concept for the Intel® Xeon® CPU Max Series with high bandwidth memory, for instance, was spawned by some features from the Intel Xeon Phi platform, the first product to integrate an innovative memory architecture for high bandwidth and high capacity on package.
Additionally, the need for high performance drove further advances across all subsystems, from the compute blade’s thermo-mechanical solution to its dense physical integration, to storage.
“Intel ended up architecting a completely new storage concept, DAOS (distributed asynchronous object storage),” Franza says. It’s an open source software ecosystem to enable high-speed storage on traditional hardware. “Aurora will be among the first systems to use it, and by far the largest.”
From Designing Components to Bolting Together Thousands of Systems
The Aurora project drove system-level thinking and broad collaboration across various business units inside Intel, as well as with Argonne scientists and engineers at Hewlett Packard Enterprise, the project’s other main partner.
“Getting the whole team to align and deliver a machine like Aurora is, for many of us, a once-in-a-lifetime experience,” Franza says.
Although engineers installed the final blade in June, the project continues to keep Franza up at night as the system passes through the stages of testing, stabilization and validation at scale.
He provides guidance to a large team working on system bring-up, validation, stabilization, optimization and enablement of full-system performance workloads. Most notable is the High Performance Linpack (HPL) benchmark that determines the top systems in the world, as certified by the bi-annual Top500 list.
Each morning, Franza joins the daily standup meeting to scrutinize nightly runs on every single node and makes a game plan for the next day’s work and beyond. Each afternoon, a daily closeout meeting summarizes progress and hurdles. The work never stops; the machine always runs.
“We have a step-by-step approach to methodically validate and stabilize at scale,” he explains. “You start with the blade, then move to the rack, then multiple racks, and you scale from there.”
Aurora is made up of 10,624 compute blades, boasting 63,744 Intel Max Series GPUs – more GPUs than any other system in the world – and 21,248 Intel Xeon Max CPUs across 166 racks.
“It’s the size of four tennis courts, which sounds like a lot, right?” he says. “But it’s only when you actually go see it that you just realize the sheer magnitude of the project.”
Franza must ensure the vast system is stable, functional and performing. It’s a daunting task, but the end is within reach.
“Walking through the aisles, with all the lights on, and feeling that the machine is running is impressive and obviously extremely rewarding,” he says. “It’s a very tangible achievement that speaks for itself.”
A ‘Once-in-a-Lifetime’ Effort, a Science-Shaping Supercomputer
What keeps him going, through engineering hurdles and unexpected roadblocks, is the opportunity to build “an extraordinary machine” that will power impactful research. He cites Aurora’s enormous potential for cancer research as an area where the project will benefit us all.
“I think that’s something that is going to make us very proud,” he says.
Not only will Aurora work on solving some of the most complex scientific and engineering problems in the world, it will also be an ideal platform for running generative AI and applying it to research. “It will enable one of the biggest large language models planned to date, the 1 trillion parameter Aurora GenAI project, enhancing, enabling and easing the lives of scientists,” Franza says.
But it’s the teamwork and camaraderie he enjoys more than anything else.
“It’s an extended effort, and it requires a lot of perseverance,” he says. “The core team has maintained a marathon mentality where it’s not over until it’s over. We needed the kind of people that can effectively focus for a long time on something immensely challenging. And in the end, the accomplishment is something that very few can say they have achieved.”
Source: cyberpogo.com