How Do I Speed Up My Tensorflow Transformer Models?


Transformer models have gained much attention in recent years and have been responsible for many of the advances in Natural Language Processing (NLP). Transformer models have often replaced Recurrent Neural Networks for many use cases like machine translation, text summarization, and document classification. For organizations, it can be challenging to deploy transformer models in production and perform inference because inference can be expensive, and the implementation can be complex. Recently we announced the public preview for a new runtime that optimizes serving TensorFlow (TF) models on the Vertex AI Prediction service. We are happy to announce that the optimized Tensorflow runtime is now GA. The optimized Tensorflow runtime generally results in faster predictions and better throughput than most open source based pre-built TensorFlow serving containers.

In this post, you learn how to deploy a fine-tuned T5x base model to the Vertex AI Prediction service using the optimized TensorFlow runtime and then evaluate the model performance. See the optimized TF runtime in the Vertex AI user guide for more details on how to use the runtime.


In this example, you will use the T5x base model. T5x is a new and improved implementation of the T5 codebase (based on Mesh TensorFlow) in JAX and Flax. T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. The model can be fine-tuned for specific tasks that it was not trained for. In this example, you use the model that’s fine-tuned for English to German translations that is JAX based.

Read More  7 Ways Google Is Using AI To Help Solve Society's Challenges

The T5x pre-trained model can be exported as a TensorFlow Saved Model, and deployed to Vertex AI Prediction service using the optimized Tensorflow runtime. To do this, you can use the export script.

Deploying T5x on Vertex AI Predictions using the optimized TensorFlow runtime

To help you understand the notebook, you can follow these steps in a Vertex AI Workbench notebook or Colab. In this example, there are two types of exported models: one exported using float32 and one with bfloat16. The latter is a native format used for Google Cloud TPUs that you can also use with the NVIDIA A100, which the Optimized TF Runtime supports. You must use float32 with NVIDIA T4 or V100. If you use float32, you can leverage ¨C4C to run models on lower precision. The NVIDIA T4 doesn’t support ¨C5C.¨C14C

# Set your optimized TF runtime image

# Set the compute and GPU type
DEPLOY_COMPUTE_T4 = "n1-standard-8"

Next, upload and deploy the T5x base model with float32 weights and no optimizations on Vertex AI.

# Upload the T5x model to Vertex AI 
T5x_base_float32 = aiplatform.Model.upload(

# Deploy the T5x model to Vertex AI
T5x_base_float32_t4_endpoint = T5x_base_float32.deploy(
    traffic_split={"0": 100},

You can deploy the T5x model with different weight settings or model optimization flags. You can enable the following features to further optimize serving TensorFlow models:

  1. Model precompilation
  2. Optimization that affects precision

In the Notebook example, you can find different configurations. For the best performance when you deploy the T5x model with bfloat16 weights on NVIDIA A100, use the –allow_precompilation argument to effectively utilize the bfloat16 logic.

DEPLOY_COMPUTE_A100 = "a2-highgpu-1g"

# Upload the model to Vertex AI
T5x_base_bfloat16_precompiled = aiplatform.Model.upload(

# Deploy the model 
T5x_base_bfloat16_precompiled_a100_endpoint = T5x_base_bfloat16_precompiled.deploy(
    traffic_split={"0": 100},

After the models are deployed, you’re ready to send requests to the endpoints.

instances = [{"text_batch": "translate English to German: this is good"}]

# Get predictions

Benchmarking the T5x base model deployed using the optimized TF runtime on Vertex AI

To evaluate the benefits of using the optimized Tensorflow runtime with Vertex AI, we benchmarked the T5x model deployed on Vertex AI using MLPerf inference loadgen for Vertex Prediction. MLPerf Inference is a benchmark suite for measuring how fast systems can run models in various deployment scenarios. The main goal of benchmarking is to measure model latency on different loads and to identify the maximum throughput the model can handle. The benchmark code is included in the notebook and is reproducible.

Read More  A Guide to Key Terms in Generative AI and Large Language Models

Below you can see two charts that visualize the benchmark results for throughput and latency. The T5x model is deployed on Vertex AI Prediction using n1-standard-16 compute, NVIDIA T4 and A100 GPU instances, the optimized TensorFlow runtime, and TF nightly GPU containers.

Chart one shows the performance results for T5x deployed on Vertex AI using the NVIDIA T4 GPU. For this deployment we used the T5x model with float32 weights, because T4 doesn’t support bfloat16. The first bar shows the performance without any optimizations, the second bar shows the performance with precompilation enabled, and the third bar shows the results with precompilation and compression enabled.

Charts 1: T5x base model on T4

Chart one shows the performance results for T5x deployed on Vertex AI using the NVIDIA A100 GPU. The first bar shows T5x performance results without optimizations. The second bar shows the performance results with pre-compilations enabled using the bfloat16 format.

Charts 2: T5x base model on A100

Use of the optimized TensorFlow runtime resulted in significantly lower latency and higher throughput for the T5x base model. Because the optimized TensorFlow runtime moves most of the computations to the GPU, you can use machines with less CPU power. In the table below you can see the overall improvements for throughput and latency.

What’s next?

To learn more about the optimized Tensorflow runtime and Vertex AI, take a look at the additional examples on BERT and Criteo or check out the optimized Tensorflow runtime documentation. You’ll see how easy it is to deploy your optimized model on Vertex AI.

Read More  AI is Not 'the Future.' You Need it Today

By: Erwin Huizenga (Developer Advocate Machine Learning) and Aleksey Vlasenko (Software Engineer)
Originally published at Google Cloud Blog

Source: Cyberpogo

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Read More


Introducing Apple Intelligence, the personal intelligence system that puts powerful generative models at the core of iPhone, iPad, and Mac

10 June 2024PRESS RELEASE Introducing Apple Intelligence, the personal intelligence system that puts powerful gener
Read More
tvOS 18 introduces intelligent new features like InSight that level up cinematic experiences. Users can stream Palm Royale on the Apple TV app with a subscription.

Updates to the Home experience elevate entertainment and bring more convenience 

10 June 2024 PRESS RELEASE tvOS 18 introduces new cinematic experiences with InSight, Enhance Dialogue, and subtitles CU
Read More