Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • About
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • About
Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • About
  • Artificial Intelligence

IBM And PyTorch Change One Line Of Code To Massively Improve AI Model Training

  • May 8, 2023
  • liwaiwai.com

IBM Research and PyTorch worked on a way to make checkpointing for AI training considerably more cost-effective.

We’re in the middle of an AI boom, powered by foundation models. Many of these models are huge in size, with training datasets that are even larger. They can have terabytes of data that are used for training models that have billions of parameters. These models can be used for all sorts of purposes, from generating music, to uncovering new molecules and automating massive enterprise processes.


Partner with liwaiwai.com
for your next big idea.
Let us know here.


cyberpogo

IBM is working on building trustworthy AI foundation models for enterprises that run seamlessly on public and private clouds. We are building a cloud-native, open-source stack for the future of AI, which is powered by PyTorch to help make building AI systems simpler. During model training, a data scientist periodically writes checkpoints to a system’s permanent storage for fault tolerance and help recover from failures. To reduce wasted GPU time, writing checkpoints must be done quickly, usually this is done by writing to a high-speed shared file system like NFS. Perhaps unsurprisingly, as the model increases in size, so does the checkpoint data size. For example, a model with 11 billion parameters (at 32-bit precision) needs 130 GB of storage for a single checkpoint that includes model weights and the optimizer state.

We wanted to see if we could use a cheaper type of storage that would not sacrifice GPU time. A simple switch from using a shared file system to object storage resulted in unexplainable errors (such as NCCL timeouts and silent failures). IBM and PyTorch teams worked together to fix the distributed checkpointing within PyTorch to support object storages that have S3 APIs, like IBM Cloud Object Storage. This required examining the differences between traditional file system APIs and S3FS (object store backed) file system APIs.

Read More  How CI&T Accelerated Development By 11% With AI From Tabnine And Google Cloud

The current distributed checkpointing implementation assumes a Portable Operating System Interface (POSIX) compliant file system that guarantees strong read-after-write consistency. In PyTorch FSDP (Fully Sharded Data Parallel), each GPU is responsible for a shard of the model and optimizer state and writes tensors of a shard to a specific directory designated for that GPU. In a shared file system, when an orchestrator (rank0 GPU) creates a directory, it is visible to all the other GPUs immediately. However, in an eventually consistent file system layer like S3FS on top of an object store, this assumption fails.

Given these consistency problems manifest as non-reproducible errors, the teams had to comb through many log files and come up with many hypotheses to identify the root cause. At the end, the fix was to enable each GPU to create a directory if it does not exist in its local view. Since the writes are non-conflicting, S3FS eventual consistency semantics suffice. With this change, distributed checkpointing of PyTorch can support writing to object storages with S3FS APIs.

An alternative approach for writing checkpoints would have been to gather all the weights and optimizer state to a single node’s RAM, which can result in crashes during the all-gather phase, drastically wasting GPU time and serving as a major blocker. Our measurements showed that for an 11 billion-parameter model, it cuts down checkpointing time from more than an hour to just minutes. Also, object storage is one of the most affordable file storage systems there is, so by making the switch, the researchers could save on training costs.

Read More  Five Reasons Why Business Automation Initiatives Fail And How To Avoid Them

“We’ve been using a race car when a commuter car would work,” Raghu Ganti, a principal researcher at IBM Research and one of the project leads, said of the type of storage the team had been using. “You can do sophisticated model training with the affordability of object storage.”

Combining this milestone with the team’s earlier work to use inexpensive networking equipment and improve memory scheduling, IBM and PyTorch are getting further down the path to where training and running AI models is quicker, more cost-effective, and able to scale to even larger models — all on IBM’s hybrid cloud platform.

Originally publish at IBM Blog

Source: Cyberpogo


Our humans need coffee too! Your support is highly appreciated, thank you!

liwaiwai.com

Related Topics
  • AI
  • Artificial Intelligence
  • IBM
  • Python
  • PyTorch
You May Also Like
View Post
  • Artificial Intelligence
  • Software Engineering

When The Rubber Duck Talks Back

  • June 1, 2023
View Post
  • Artificial Intelligence
  • Technology

Helping Robots Handle Fluids

  • June 1, 2023
View Post
  • Artificial Intelligence

Introducing 100K Context Windows

  • May 30, 2023
View Post
  • Architecture
  • Artificial Intelligence
  • Design

Sandvik unveils the Impossible Statue – an AI-enabled collaboration between Michelangelo, Rodin, Kollwitz, Kotaro, Savage and Sandvik

  • May 30, 2023
View Post
  • Artificial Intelligence
  • Technology

Capgemini And Google Cloud Expand Long-Standing Partnership To Create First-Of-Its Kind Generative AI Center Of Excellence To Accelerate Client Value

  • May 30, 2023
View Post
  • Artificial Intelligence

How Auditoria.AI Is Building AI-Powered Smart Assistants For Finance Teams

  • May 29, 2023
View Post
  • Artificial Intelligence
  • Technology

AI Coming To The PC At Scale

  • May 27, 2023
View Post
  • Artificial Intelligence
  • Platforms

Build Next-Generation, AI-Powered Applications On Microsoft Azure

  • May 26, 2023
Stay Connected!
LATEST
  • 1
    When The Rubber Duck Talks Back
    • June 1, 2023
  • 2
    Helping Robots Handle Fluids
    • June 1, 2023
  • 3
    Introducing 100K Context Windows
    • May 30, 2023
  • 4
    Sandvik unveils the Impossible Statue – an AI-enabled collaboration between Michelangelo, Rodin, Kollwitz, Kotaro, Savage and Sandvik
    • May 30, 2023
  • 5
    Capgemini And Google Cloud Expand Long-Standing Partnership To Create First-Of-Its Kind Generative AI Center Of Excellence To Accelerate Client Value
    • May 30, 2023
  • 6
    Effective Management Of Data Sources In Machine Learning
    • May 29, 2023
  • 7
    How Auditoria.AI Is Building AI-Powered Smart Assistants For Finance Teams
    • May 29, 2023
  • 8
    G7 2023: The Real Threat To The World Order Is Hypocrisy.
    • May 28, 2023
  • 9
    AI Coming To The PC At Scale
    • May 27, 2023
  • 10
    Build Next-Generation, AI-Powered Applications On Microsoft Azure
    • May 26, 2023

about
About
Hello World!

We are liwaiwai.com. Created by programmers for programmers.

Our site aims to provide materials, guides, programming how-tos, and resources relating to artificial intelligence, machine learning and the likes.

We would like to hear from you.

If you have any questions, enquiries or would like to sponsor content, kindly reach out to us at:

[email protected]

Live long & prosper!
Most Popular
  • 1
    Combining Generative AI With IBM Watson, Mitsui Chemicals Starts Verifying New Application Discovery For Agility And Accuracy
    • May 25, 2023
  • 2
    Wipro Expands Google Cloud Partnership To Advance Enterprise Adoption Of Generative AI
    • May 23, 2023
  • 3
    Google Cloud Launches AI-Powered Solutions To Safely Accelerate Drug Discovery And Precision Medicine
    • May 16, 2023
  • 4
    Huawei And Partners Announce Yucatan Wildlife Conservation Findings
    • May 18, 2023
  • 5
    Cloudflare’s R2 Is The Infrastructure Powering Leading AI Companies
    • May 16, 2023
  • /
  • Artificial Intelligence
  • Explore
  • About
  • Contact Us

Input your search keywords and press Enter.