Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
  • Artificial Intelligence

IBM And PyTorch Change One Line Of Code To Massively Improve AI Model Training

  • May 8, 2023
  • liwaiwai.com

IBM Research and PyTorch worked on a way to make checkpointing for AI training considerably more cost-effective.

We’re in the middle of an AI boom, powered by foundation models. Many of these models are huge in size, with training datasets that are even larger. They can have terabytes of data that are used for training models that have billions of parameters. These models can be used for all sorts of purposes, from generating music, to uncovering new molecules and automating massive enterprise processes.


Partner with liwaiwai.com
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

IBM is working on building trustworthy AI foundation models for enterprises that run seamlessly on public and private clouds. We are building a cloud-native, open-source stack for the future of AI, which is powered by PyTorch to help make building AI systems simpler. During model training, a data scientist periodically writes checkpoints to a system’s permanent storage for fault tolerance and help recover from failures. To reduce wasted GPU time, writing checkpoints must be done quickly, usually this is done by writing to a high-speed shared file system like NFS. Perhaps unsurprisingly, as the model increases in size, so does the checkpoint data size. For example, a model with 11 billion parameters (at 32-bit precision) needs 130 GB of storage for a single checkpoint that includes model weights and the optimizer state.

We wanted to see if we could use a cheaper type of storage that would not sacrifice GPU time. A simple switch from using a shared file system to object storage resulted in unexplainable errors (such as NCCL timeouts and silent failures). IBM and PyTorch teams worked together to fix the distributed checkpointing within PyTorch to support object storages that have S3 APIs, like IBM Cloud Object Storage. This required examining the differences between traditional file system APIs and S3FS (object store backed) file system APIs.

Read More  Introducing PaLM 2

The current distributed checkpointing implementation assumes a Portable Operating System Interface (POSIX) compliant file system that guarantees strong read-after-write consistency. In PyTorch FSDP (Fully Sharded Data Parallel), each GPU is responsible for a shard of the model and optimizer state and writes tensors of a shard to a specific directory designated for that GPU. In a shared file system, when an orchestrator (rank0 GPU) creates a directory, it is visible to all the other GPUs immediately. However, in an eventually consistent file system layer like S3FS on top of an object store, this assumption fails.

Given these consistency problems manifest as non-reproducible errors, the teams had to comb through many log files and come up with many hypotheses to identify the root cause. At the end, the fix was to enable each GPU to create a directory if it does not exist in its local view. Since the writes are non-conflicting, S3FS eventual consistency semantics suffice. With this change, distributed checkpointing of PyTorch can support writing to object storages with S3FS APIs.

An alternative approach for writing checkpoints would have been to gather all the weights and optimizer state to a single node’s RAM, which can result in crashes during the all-gather phase, drastically wasting GPU time and serving as a major blocker. Our measurements showed that for an 11 billion-parameter model, it cuts down checkpointing time from more than an hour to just minutes. Also, object storage is one of the most affordable file storage systems there is, so by making the switch, the researchers could save on training costs.

Read More  Top Tech Trends For 2021: Gartner Predicts Hyperautomation, AI And More Will Dominate Business Technology

“We’ve been using a race car when a commuter car would work,” Raghu Ganti, a principal researcher at IBM Research and one of the project leads, said of the type of storage the team had been using. “You can do sophisticated model training with the affordability of object storage.”

Combining this milestone with the team’s earlier work to use inexpensive networking equipment and improve memory scheduling, IBM and PyTorch are getting further down the path to where training and running AI models is quicker, more cost-effective, and able to scale to even larger models — all on IBM’s hybrid cloud platform.

Originally publish at IBM Blog

Source: Cyberpogo


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

liwaiwai.com

Related Topics
  • AI
  • Artificial Intelligence
  • IBM
  • Python
  • PyTorch
You May Also Like
OpenAI
View Post
  • Artificial Intelligence
  • Platforms

How We Interact With Information: The New Era Of Search

  • September 28, 2023
View Post
  • Artificial Intelligence
  • Engineering
  • Machine Learning
  • Platforms

Bring AI To Looker With The Machine Learning Accelerator

  • September 28, 2023
View Post
  • Artificial Intelligence
  • Technology

Microsoft And Mercy Collaborate To Empower Clinicians To Transform Patient Care With Generative AI

  • September 27, 2023
View Post
  • Artificial Intelligence
  • Machine Learning

Canonical releases Charmed MLFlow

  • September 26, 2023
View Post
  • Artificial Intelligence
  • Technology

NASA’s Mars Rovers Could Inspire A More Ethical Future For AI

  • September 26, 2023
View Post
  • Artificial Intelligence
  • Platforms

Oracle CloudWorld 2023: 6 Key Takeaways From The Big Annual Event

  • September 25, 2023
View Post
  • Artificial Intelligence

3 Ways AI Can Help Communities Adapt To Climate Change In Africa

  • September 25, 2023
Robotic Hand | Lights
View Post
  • Artificial Intelligence
  • Technology

Nvidia H100 Tensor Core GPUs Come To Oracle Cloud

  • September 24, 2023
A Field Guide To A.I.
Navigate the complexities of Artificial Intelligence and unlock new perspectives in this must-have guide.
Now available in print and ebook.

charity-water



Stay Connected!
LATEST
  • OpenAI 1
    How We Interact With Information: The New Era Of Search
    • September 28, 2023
  • 2
    Bring AI To Looker With The Machine Learning Accelerator
    • September 28, 2023
  • 3
    3 Questions: A New PhD Program From The Center For Computational Science And Engineering
    • September 28, 2023
  • 4
    Microsoft And Mercy Collaborate To Empower Clinicians To Transform Patient Care With Generative AI
    • September 27, 2023
  • 5
    Canonical releases Charmed MLFlow
    • September 26, 2023
  • 6
    NASA’s Mars Rovers Could Inspire A More Ethical Future For AI
    • September 26, 2023
  • 7
    Oracle CloudWorld 2023: 6 Key Takeaways From The Big Annual Event
    • September 25, 2023
  • 8
    3 Ways AI Can Help Communities Adapt To Climate Change In Africa
    • September 25, 2023
  • Robotic Hand | Lights 9
    Nvidia H100 Tensor Core GPUs Come To Oracle Cloud
    • September 24, 2023
  • 10
    AI-Driven Tool Makes It Easy To Personalize 3D-Printable Models
    • September 22, 2023

about
About
Hello World!

We are liwaiwai.com. Created by programmers for programmers.

Our site aims to provide materials, guides, programming how-tos, and resources relating to artificial intelligence, machine learning and the likes.

We would like to hear from you.

If you have any questions, enquiries or would like to sponsor content, kindly reach out to us at:

[email protected]

Live long & prosper!
Most Popular
  • 1
    Huawei: Advancing a Flourishing AI Ecosystem Together
    • September 22, 2023
  • Coffee | Laptop | Notebook | Work 2
    First HP Work Relationship Index Shows Majority of People Worldwide Have an Unhealthy Relationship with Work
    • September 20, 2023
  • 3
    Huawei Connect 2023: Accelerating Intelligence For Shared Success
    • September 20, 2023
  • 4
    Applying Generative AI To Product Design With BigQuery DataFrames
    • September 21, 2023
  • 5
    Combining AI With A Trusted Data Approach On IBM Power To Fuel Business Outcomes
    • September 21, 2023
  • /
  • Artificial Intelligence
  • Explore
  • About
  • Contact Us

Input your search keywords and press Enter.