Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • About
Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • About
  • Data

Reading And Storing Data For Custom Model Training On Vertex AI

  • January 27, 2023
  • relay

Before you can train ML models in the cloud, you need to get your data to the cloud.

But when it comes to storing data on Google Cloud there are a lot of different options. Not to mention the different ways you can read in data when designing input pipelines for custom models. Should you use the Cloud Storage API? Copy data directly to the machine where your training job is running? Use the data I/O library of your preferred ML framework?

To make things a little easier for you, we’ve outlined some recommendations for reading data in your custom training jobs on Vertex AI. Whether your use case requires structured or unstructured data, these tips will help you to build more efficient input pipelines with Vertex AI.

Unstructured Data

Cloud Storage FUSE

If you have unstructured data, such as images, the best place to start is by uploading your data to a Cloud Storage bucket. Instead of using gsutil to copy all of the data over to the machine where your custom training job will run, or calling the Cloud Storage APIs directly or from a client library, you can leverage Cloud Storage FUSE.

Using the Cloud Storage FUSE tool, training jobs on Vertex AI can access data on Cloud Storage as files in the local file system. When you start a custom training job, the job sees a directory /gcs, which contains all your Cloud Storage buckets as subdirectories. This happens automatically without any extra work on your part.

Not only does this make it easy to access your data, but it also provides high throughput for large file sequential reads.

Read More  Scaling Heterogeneous Graph Sampling For GNNs With Google Cloud Dataflow

For example, if your data is a collection of JPEG files in a Cloud Storage bucket called ¨C2C you can access this data in your training code with the path /gcs/training-images

import tensorflow as tf
 
DATA_DIR = '/gcs/training-images'
dataset = tf.keras.utils.image_dataset_from_directory(data_dir=DATA_DIR)

And if you’re a PyTorch user, your code might look something like this:

import torch
from torchvision import datasets
 
DATA_DIR = '/gcs/training-images'
dataset = datasets.ImageFolder(DATA_DIR)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

Mount an NFS Share

While Cloud Storage FUSE is easy to use and will work for most cases, if you need particularly high throughput you can consider mounting a Network File System (NFS) share for custom training. This allows your jobs to access remote files as if they are local with high throughput and low latency.

Before you begin, there are two steps you’ll need to take:

  1. First, create an NFS share in a Virtual Private Cloud (VPC). Your share must be accessible without authentication.
  2. Then, follow the instructions in Set up VPC Network Peering to peer Vertex AI with the VPC that hosts your NFS share.

Once you have the NFS share and VPC peering set up, you are ready to use NFS with your custom training jobs on Vertex AI.

When you create your custom training job, you’ll need to specify the nfsMounts field and network fields. You can do this in a config.yaml file:

network: projects/PROJECT_NUMBER/global/networks/default
workerPoolSpecs:
  - machineSpec:
      machineType: n1-standard-8
    replicaCount: 1
    containerSpec:
      imageUri: 'gcr.io/PROJECT_ID/nfs-demo:latest'
    nfsMounts:
      - server: 10.76.0.10
        path: /fileshare
        mountPoint: my_mount

And then pass in the config when submitting the job:

gcloud ai custom-jobs create \
 --region={LOCATION} \
 --display-name={JOB_NAME} \
 --config=config.yaml

Structured Data

Multiple options exist when you want to train a machine learning model on structured data. Most of the time, you’ll use BigQuery for storing the training data. When you can’t use BigQuery, for example, if you want to use the TFRecord format, you can follow the instructions described in the unstructured section above.

Read More  How Data-Driven Coaching Helps Employees Reach Their Potential

In the second part of this blog, we’ll discuss the best options for reading training data from BigQuery. Note that there might be other options, but we’ll focus on some of the best and easiest to get started with options.

Structured data with BigQuery

TensorFlow and BigQuery

When your data sits in BigQuery then that’s a great start. If you’re a TensorFlow user, you can use the BigQuery Connector to read training data. The BigQuery connector relies on the BigQuery Storage API, which provides fast access to BigQuery’s managed storage using an rpc-based protocol.

The BigQuery connector mostly follows the BigQuery Storage API flow, but hides the complexity associated with decoding serialized data rows into Tensors. You need to follow these steps:

  1. Create a BigQueryClient client.
  2. Use the BigQueryClient to create a BigQueryReadSession object corresponding to a read session. A read session divides the contents of a BigQuery table into one or more streams for reading the data.
  3. Call parallel_read_rows on the BigQueryReadSession object to read from multiple BigQuery streams in parallel.

If you’re using TensorFlow, your code might look something like this:

from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
 
# create BigQueryClient
client = BigQueryClient()
 
# create BigQueryReadSession
read_session = client.read_session(PROJECT_ID,
                                   TABLE_ID,
                                   DATASET_ID,
                                   selected_fields=[],
                                   output_types=[],
                                   default_values=[],
 
# set the DataFormat                                   data_format=BigQueryClient.DataFormat.AVRO)
 
# call parallel_read_rows
dataset = read_session.parallel_read_rows()

BigQuery alternatives

If you’re not using TensorFlow, then there are some alternatives you can look at. Here are two depending if you are a PyTorch or XGBoost user.

PyTorch and BigQuery

If you’re a PyTorch user, there are multiple options for reading data from BigQuery. We recommend you create an iterable-style DataPipe using the torchdata.datapipes.iter.IterDataPipe()class. When creating a DataPipe you can leverage the BigQuery Storage Read API for reading your training data.

Read More  Microsoft Build 2019 | The Ethical Challenges of Building Facial Recognition Systems

XGBoost and BigQuery

When using XGBoost with Vertex AI, you can use scalable Python on BigQuery using Dask and NVIDIA RAPIDS. Dask offers integration with XGBoost. It’s possible to extend Dask with RAPIDS, a suite of open-source libraries and APIs to execute GPU-accelerated pipelines directly on BigQuery storage. The code for Dask would look something like this:

import dask_bigquery
 
# read data from BigQuery 
dask_df = dask_bigquery.read_gbq(
   project_id="your_project_id",
   dataset_id="your_dataset",
   table_id="your_table",
)
 
# inspect dataframe 
dask_df.head()

Alternatively, BigQuery has support for boosted tree models through BigQuery ML. This way you don’t have to take your data out of BigQuery.

All in one overview

What is next

Efficient data pipelines are a key piece of effective ML experimentation and iteration. In this blog we looked at several recommendations for reading structured and unstructured data in your custom training jobs. If you’re looking to get started training some ML models of your own on Vertex AI, check out this introductory video series or run through this codelab. Now it’s time to train some ML models of your own!

By: Nikita Namjoshi (Developer Advocate) and Erwin Huizenga (Developer Advocate, Google Cloud)
Source: Google Cloud Blog

relay

Related Topics
  • Artificial Intelligence
  • BigQuery
  • Data
  • Data Pipelines
  • Google Cloud
  • Python
  • Storage
  • TensorFlow
  • Vertex AI
You May Also Like
View Post
  • Data
  • Design
  • Engineering
  • Tools

Sumitovant More Than Doubles Its Research Output In Its Quest To Save Lives

  • March 21, 2023
View Post
  • Data
  • Platforms
  • Technology

How Osmo Is Digitizing Smell With Google Cloud AI Technology

  • March 20, 2023
View Post
  • Data
  • Engineering
  • Tools

Built With BigQuery: How Sift Delivers Fraud Detection Workflow Backtesting At Scale

  • March 20, 2023
View Post
  • Data

Understand And Trust Data With Dataplex Data Lineage

  • March 17, 2023
View Post
  • Big Data
  • Data

The Benefits And Core Processes Of Data Wrangling

  • March 17, 2023
View Post
  • Artificial Intelligence
  • Data
  • Machine Learning
  • Technology

ChatGPT: How To Prevent It Becoming A Nightmare For Professional Writers

  • March 16, 2023
View Post
  • Data
  • Engineering
  • Machine Learning

Sentiment Analysis With BigQuery ML

  • March 13, 2023
View Post
  • Artificial Intelligence
  • Data

Introducing Casual Conversations v2: A More Inclusive Dataset To measure Fairness

  • March 13, 2023
Stay Connected!
LATEST
  • 1
    6 ways Google AI Is Helping You Sleep Better
    • March 21, 2023
  • 2
    AI Could Make More Work For Us, Instead Of Simplifying Our Lives
    • March 21, 2023
  • 3
    Microsoft To Showcase Purpose-Built AI Infrastructure At NVIDIA GTC
    • March 21, 2023
  • 4
    The Next Generation Of AI For Developers And Google Workspace
    • March 21, 2023
  • 5
    Sumitovant More Than Doubles Its Research Output In Its Quest To Save Lives
    • March 21, 2023
  • 6
    How Osmo Is Digitizing Smell With Google Cloud AI Technology
    • March 20, 2023
  • 7
    Built With BigQuery: How Sift Delivers Fraud Detection Workflow Backtesting At Scale
    • March 20, 2023
  • 8
    Building The Most Open And Innovative AI Ecosystem
    • March 20, 2023
  • 9
    Understand And Trust Data With Dataplex Data Lineage
    • March 17, 2023
  • 10
    Limits To Computing: A Computer Scientist Explains Why Even In The Age Of AI, Some Problems Are Just Too Difficult
    • March 17, 2023

about
About
Hello World!

We are liwaiwai.com. Created by programmers for programmers.

Our site aims to provide materials, guides, programming how-tos, and resources relating to artificial intelligence, machine learning and the likes.

We would like to hear from you.

If you have any questions, enquiries or would like to sponsor content, kindly reach out to us at:

[email protected]

Live long & prosper!
Most Popular
  • 1
    The Benefits And Core Processes Of Data Wrangling
    • March 17, 2023
  • 2
    We Cannot Even Agree On Dates…
    • March 17, 2023
  • 3
    Financial Crisis: It’s A Game & We’re All Being Played
    • March 17, 2023
  • 4
    Using ML To Predict The Weather And Climate Risk
    • March 16, 2023
  • 5
    Google Is A Leader In The 2023 Gartner® Magic Quadrant™ For Enterprise Conversational AI Platforms
    • March 16, 2023
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
  • About

Input your search keywords and press Enter.