- At OCP Summit 2022, we’re saying Grand Teton, our next-generation platform for AI at scale that we’ll contribute to the OCP neighborhood.
- We’re additionally sharing new improvements designed to assist knowledge facilities as they advance to assist new AI applied sciences:
- A brand new, extra environment friendly model of Open Rack.
- Our Air-Assisted Liquid Cooling (AALC) – design.
- Grand Canyon, our new HDD storage system.
- You’ll be able to view AR/VR fashions of our newest {hardware} designs at: https://metainfrahardware.com
Empowering Open, the theme of this 12 months’s Open Compute Challenge (OCP) World Summit has at all times been on the coronary heart of Meta’s design philosophy. Open-source {hardware} and software program is, and can at all times be, a pivotal device to assist the {industry} remedy issues at massive scale.
At present, a number of the best challenges our {industry} is going through at scale are round AI. How can we proceed to facilitate and run the fashions that drive the experiences behind at present’s progressive services? And what is going to it take to allow the AI behind the progressive services of the longer term? As we transfer into the following computing platform, the metaverse, the necessity for brand spanking new open improvements to energy AI turns into even clearer.
As a founding member of the OCP neighborhood, Meta has at all times embraced open collaboration. Our historical past of designing and contributing next-generation AI methods dates again to 2016, after we first introduced Huge Sur. That work continues at present and is at all times evolving as we develop higher methods to serve AI workloads.
After 10 years of constructing world-class knowledge facilities and distributed compute methods, we’ve come a great distance from creating {hardware} unbiased of the software program stack. Our AI and machine studying (ML) fashions have gotten more and more highly effective and complex and wish extra high-performance infrastructure to match. Deep studying suggestion fashions (DLRMs), for instance, have on the order of tens of trillions of parameters and may require a zettaflop of compute to coach.
At this 12 months’s OCP Summit, we’re sharing our journey as we proceed to reinforce our knowledge facilities to fulfill Meta, and the {industry}’s, large-scale AI wants. From new platforms for coaching and working AI fashions, to energy and rack improvements to assist our knowledge facilities deal with AI extra effectively, and even new developments with PyTorch, our signature machine studying framework – we’re releasing open improvements to assist remedy industry-wide challenges and push AI into the longer term.
Grand Teton: AI platform
We’re saying Grand Teton, our next-generation, GPU-based {hardware} platform, a follow-up to our Zion-EX platform. Grand Teton has a number of efficiency enhancements over its predecessor, Zion, resembling 4x the host-to-GPU bandwidth, 2x the compute and knowledge community bandwidth, and 2x the ability envelope. Grand Teton additionally has an built-in chassis in distinction to Zion-EX, which includes a number of unbiased subsystems.
As AI fashions turn out to be more and more subtle, so will their related workloads. Grand Teton has been designed with larger compute capability to raised assist memory-bandwidth-bound workloads at Meta, resembling our open supply DLRMs. Grand Teton’s expanded operational compute energy envelope additionally optimizes it for compute-bound workloads, resembling content material understanding.
The previous-generation Zion platform consists of three packing containers: a CPU head node, a swap sync system, and a GPU system, and requires exterior cabling to attach all the things. Grand Teton integrates this right into a single chassis with totally built-in energy, management, compute, and material interfaces for higher total efficiency, sign integrity, and thermal efficiency.
This excessive stage of integration dramatically simplifies the deployment of Grand Teton, permitting it to be launched into knowledge middle fleets quicker and with fewer potential factors of failure, whereas offering fast scale with elevated reliability.
Rack and energy improvements
Open Rack v3
The newest version of our Open Rack {hardware} is right here to supply a standard rack and energy structure for the complete {industry}. To bridge the hole between current and future knowledge middle wants, Open Rack v3 (ORV3) is designed with flexibility in thoughts, with a body and energy infrastructure able to supporting a variety of use instances — together with assist for Grand Teton.
ORV3’s energy shelf isn’t bolted to the busbar. As an alternative, the ability shelf installs wherever within the rack, which allows versatile rack configurations. A number of cabinets may be put in on a single busbar to assist 30kW racks, whereas 48VDC output will assist the upper energy transmission wants of future AI accelerators.
It additionally options an improved battery backup unit, upping the capability to 4 minutes, in contrast with the earlier mannequin’s 90 seconds, and with an influence capability of 15kW per shelf. Like the ability shelf, this backup unit installs wherever within the rack for personalisation and supplies 30kW when put in as a pair.
Meta selected to develop virtually each part of the ORV3 design via OCP from the start. Whereas an ecosystem-led design can lead to a lengthier design course of than that of a standard in-house design, the tip product is a holistic infrastructure answer that may be deployed at scale with improved flexibility, full provider interoperability, and a various provider ecosystem.
You’ll be able to be a part of our efforts at: https://www.opencompute.org/initiatives/rack-and-power
Machine studying cooling developments vs. cooling limits
With a better socket energy comes more and more advanced thermal administration overhead. The ORV3 ecosystem has been designed to accommodate a number of completely different types of liquid cooling methods, together with air-assisted liquid cooling and facility water cooling. The ORV3 ecosystem additionally consists of an elective blind mate liquid cooling interface design, offering dripless connections between the IT gear and the liquid manifold, which permits for simpler servicing and set up of the IT gear.
In 2020, we shaped a brand new OCP focus group, the ORV3 Blind Mate Interfaces Group, with different {industry} specialists, suppliers, answer suppliers, and companions, the place we’re creating interface specs and options, resembling rack interfaces and structural enhancements to assist liquid cooling, blind mate fast (liquid) connectors, blind mate manifolds, hose and tubing necessities, blind mate IT gear design ideas, and numerous white papers on finest practices.
You may be asking your self, why is Meta so targeted on all these areas? The ability pattern will increase we’re seeing, and the necessity for liquid cooling advances, are forcing us to assume in a different way about all components of our platform, rack and energy, and knowledge middle design. The chart beneath reveals projections of elevated high-bandwidth reminiscence (HBM) and coaching module energy progress over a number of years, in addition to how these developments would require completely different cooling applied sciences over time and the bounds related to these applied sciences.

You’ll be able to be a part of our efforts at: https://www.opencompute.org/initiatives/cooling-environments
Grand Canyon: Subsequent-gen storage for AI infrastructure
Supporting ever-advancing AI fashions additionally means having the very best storage options to assist our AI infrastructure. Grand Canyon is Meta’s next-generation storage platform,that includes improved {hardware} safety and future upgrades of key commodities. The platform is designed to assist higher-density HDD’s with out efficiency degradation and with improved energy utilization.
Launching the PyTorch Basis
Since 2016, after we first partnered with the AI analysis neighborhood to create PyTorch, it has grown to turn out to be one of many main platforms for AI analysis and manufacturing functions. In September of this 12 months, we introduced the following step in PyTorch’s journey to speed up innovation in AI. PyTorch is shifting below the Linux Basis umbrella as a brand new, unbiased PyTorch Basis.
Whereas Meta will proceed to put money into PyTorch, and use it as our major framework for AI analysis and manufacturing functions, the PyTorch Basis will act as a accountable steward. It’s going to assist PyTorch via conferences, coaching programs, and different initiatives. The muse’s mission is to foster and maintain an ecosystem of vendor-neutral initiatives with PyTorch that may assist drive industry-wide adoption of AI tooling.
We stay totally dedicated to PyTorch. And we imagine this method is the quickest and finest technique to construct and deploy new methods that won’t solely deal with real-world wants, but additionally assist researchers reply elementary questions concerning the nature of intelligence.
The way forward for AI infrastructure
At Meta, we’re all-in on AI. However the way forward for AI gained’t come from us alone. It’ll come from collaboration – the sharing of concepts and applied sciences via organizations like OCP. We’re wanting to proceed working collectively to construct new instruments and applied sciences to drive the way forward for AI. And we hope that you just’ll all be a part of us in our numerous efforts. Whether or not it’s creating new approaches to AI at present or radically rethinking {hardware} design and software program for the longer term, we’re excited to see what the {industry} has in retailer subsequent.