Today NVIDIA released TensorRT 8.2, with optimizations for billion parameter NLU models. These include T5 and GPT-2, used for translation and text generation, making it possible to run NLU apps in real time.
TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for AI applications. TensorRT is used across several industries including healthcare, automotive, manufacturing, internet/telecom services, financial services, and energy.
PyTorch and TensorFlow are the most popular deep learning frameworks having millions of users. The new TensorRT framework integrations now provide a simple API in PyTorch and TensorFlow with powerful FP16 and INT8 optimizations to accelerate inference by up to 6x.
Highlights include
- TensorRT 8.2: Optimizations for T5 and GPT-2 run real-time translation and summarization with 21x faster performance compared to CPUs.
- TensorRT 8.2: Simple Python API for developers using Windows.
- Torch-TensorRT: Integration for PyTorch delivers up to 6x performance vs in-framework inference on GPUs with just one line of code.
- TensorFlow-TensorRT: Integration of TensorFlow with TensorRT delivers up to 6x faster performance compared to in-framework inference on GPUs with one line of code.
Resources
- Torch-TensorRT is available today in the PyTorch Container from the NGC catalog.
- TensorFlow-TensorRT is available today in the TensorFlow Container from the NGC catalog.
- TensorRT is freely available to members of the NVIDIA Developer Program.
- Learn more on the TensorRT product page.
Learn more
- GTC Session A31336: Accelerate Deep Learning Inference in Production with TensorRT
- GTC Session A31107: Accelerate PyTorch Inference with TensorRT
- Up to 6x Faster Inference in PyTorch on GPUs with Torch-TensorRT
- Getting Started: Torch-TensorRT, TensorFlow-TensorRT?
- Real-Time Inference for T5 and GPT-2 with TensorRT
- Notebook: T5 Translation with TensorRT
- Notebook: GPT-2 Text Generation with TensorRT