InfiniBand – NVIDIA Technical Blog

InfiniBand – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-05-30T17:25:03Z http://www.open-lab.net/blog/feed/ Ian Pegler <![CDATA[Advancing Ansys Workloads with NVIDIA Grace and NVIDIA Grace Hopper]]> http://www.open-lab.net/blog/?p=92496 2024-12-12T19:38:41Z 2024-11-21T17:30:00Z

Accelerated computing is enabling giant leaps in performance and energy efficiency compared to traditional CPU computing. Delivering these advancements requires...]]>

Accelerated computing is enabling giant leaps in performance and energy efficiency compared to traditional CPU computing. Delivering these advancements requires...

simulation-flow-features-car

Accelerated computing is enabling giant leaps in performance and energy efficiency compared to traditional CPU computing. Delivering these advancements requires full-stack innovation at data-center scale, spanning chips, systems, networking, software, and algorithms. Choosing the right architecture for the right workload with the best energy efficiency is critical to maximizing the performance and��

]]> 0 Sukru Burc Eryilmaz <![CDATA[NVIDIA Blackwell Doubles LLM Training Performance in MLPerf Training v4.1]]> http://www.open-lab.net/blog/?p=91807 2024-11-14T17:10:37Z 2024-11-13T16:00:00Z

As models grow larger and are trained on more data, they become more capable, making them more useful. To train these models quickly, more performance,...]]>

As models grow larger and are trained on more data, they become more capable, making them more useful. To train these models quickly, more performance,...

NVIDIA Blackwell Doubles LLM Training Performance in MLPerf Training v4.1

As models grow larger and are trained on more data, they become more capable, making them more useful. To train these models quickly, more performance, delivered at data center scale, is required. The NVIDIA Blackwell platform, launched at GTC 2024 and now in full production, integrates seven types of chips: GPU, CPU, DPU, NVLink Switch chip, InfiniBand Switch, and Ethernet Switch.

]]> 0 Scot Schultz <![CDATA[Advancing Performance with NVIDIA SHARP In-Network Computing]]> http://www.open-lab.net/blog/?p=90863 2024-10-31T18:36:43Z 2024-10-25T20:39:38Z

AI and scientific computing applications are great examples of distributed computing problems. The problems are too large and the computations too intensive to...]]>

AI and scientific computing applications are great examples of distributed computing problems. The problems are too large and the computations too intensive to... Picture of servers in a data center.

Picture of servers in a data center.

AI and scientific computing applications are great examples of distributed computing problems. The problems are too large and the computations too intensive to run on a single machine. These computations are broken down into parallel tasks that are distributed across thousands of compute engines, such as CPUs and GPUs. To achieve scalable performance, the system relies on dividing workloads��

]]> 0 Itay Ozery <![CDATA[Powering Next-Generation AI Networking with NVIDIA SuperNICs]]> http://www.open-lab.net/blog/?p=90176 2024-11-01T14:27:00Z 2024-10-15T16:30:00Z

In the era of generative AI, accelerated networking is essential to build high-performance computing fabrics for massively distributed AI workloads. NVIDIA...]]>

In the era of generative AI, accelerated networking is essential to build high-performance computing fabrics for massively distributed AI workloads. NVIDIA... Decorative image of SuperNICs on a black background.

Decorative image of SuperNICs on a black background.

In the era of generative AI, accelerated networking is essential to build high-performance computing fabrics for massively distributed AI workloads. NVIDIA continues to lead in this space, offering state-of-the-art Ethernet and InfiniBand solutions that maximize the performance and efficiency of AI factories and cloud data centers. At the core of these solutions are NVIDIA SuperNICs��a new��

]]> 0 Akhil Langer <![CDATA[Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0]]> http://www.open-lab.net/blog/?p=88550 2024-09-19T19:34:01Z 2024-09-06T20:30:09Z

NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on...]]>

NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on...

cube-graphic

NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on OpenSHMEM, NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams.

]]> 0 Taylor Allison <![CDATA[Simplifying Network Operations for AI with NVIDIA Quantum InfiniBand]]> http://www.open-lab.net/blog/?p=76977 2024-02-08T18:51:59Z 2024-01-23T18:00:00Z

A common technological misconception is that performance and complexity are directly linked. That is, the highest-performance implementation is also the most...]]>

A common technological misconception is that performance and complexity are directly linked. That is, the highest-performance implementation is also the most... Photo of a person standing at a computer terminal in a data center.

Photo of a person standing at a computer terminal in a data center.

A common technological misconception is that performance and complexity are directly linked. That is, the highest-performance implementation is also the most challenging to implement and manage. When considering data center networking, however, this is not the case. InfiniBand is a protocol that sounds daunting and exotic in comparison to Ethernet, but because it is built from the ground up��

]]> 0 Chris Porter <![CDATA[Energy Efficiency in High-Performance Computing: Balancing Speed and Sustainability]]> http://www.open-lab.net/blog/?p=73103 2024-10-09T20:01:06Z 2023-11-14T16:00:00Z

The world of computing is on the precipice of a seismic shift. The demand for computing power, particularly in high-performance computing (HPC), is...]]>

The world of computing is on the precipice of a seismic shift. The demand for computing power, particularly in high-performance computing (HPC), is... A windmill and solar panel illustration.

A windmill and solar panel illustration.

The world of computing is on the precipice of a seismic shift. The demand for computing power, particularly in high-performance computing (HPC), is growing year over year, which in turn means so too is energy consumption. However, the underlying issue is, of course, that energy is a resource with limitations. So, the world is faced with the question of how we can best shift our computational��

]]> 0 Ashraf Eassa <![CDATA[Setting New Records at Data Center Scale Using NVIDIA H100 GPUs and NVIDIA Quantum-2 InfiniBand]]> http://www.open-lab.net/blog/?p=72467 2023-11-24T18:36:30Z 2023-11-08T17:00:00Z

Generative AI is rapidly transforming computing, unlocking new use cases and turbocharging existing ones. Large language models (LLMs), such as OpenAI��s GPT...]]>

Generative AI is rapidly transforming computing, unlocking new use cases and turbocharging existing ones. Large language models (LLMs), such as OpenAI��s GPT...

hpc-mlperf-training-graphic

Generative AI is rapidly transforming computing, unlocking new use cases and turbocharging existing ones. Large language models (LLMs), such as OpenAI��s GPT models and Meta��s Llama 2, skillfully perform a variety of tasks on text-based content. These tasks include summarization, translation, classification, and generation of new content such as computer code, marketing copy, poetry, and much more.

]]> 0 Brian Sparks <![CDATA[Networking for Data Centers and the Era of AI]]> http://www.open-lab.net/blog/?p=71474 2023-11-02T18:14:42Z 2023-10-12T16:30:00Z

Traditional cloud data centers have served as the bedrock of computing infrastructure for over a decade, catering to a diverse range of users and applications....]]>

Traditional cloud data centers have served as the bedrock of computing infrastructure for over a decade, catering to a diverse range of users and applications....

networking-data-center-ai

Traditional cloud data centers have served as the bedrock of computing infrastructure for over a decade, catering to a diverse range of users and applications. However, data centers have evolved in recent years to keep up with advancements in technology and the surging demand for AI-driven computing. This post explores the pivotal role that networking plays in shaping the future of data centers��

]]> 0 Ashraf Eassa <![CDATA[New MLPerf Inference Network Division Showcases NVIDIA InfiniBand and GPUDirect RDMA Capabilities]]> http://www.open-lab.net/blog/?p=67021 2023-07-27T18:54:26Z 2023-07-06T16:00:00Z

In MLPerf Inference v3.0, NVIDIA made its first submissions to the newly introduced Network division, which is now part of the MLPerf Inference Datacenter...]]>

In MLPerf Inference v3.0, NVIDIA made its first submissions to the newly introduced Network division, which is now part of the MLPerf Inference Datacenter... Image of Infiniband with decorative images in front.

Image of Infiniband with decorative images in front.

In MLPerf Inference v3.0, NVIDIA made its first submissions to the newly introduced Network division, which is now part of the MLPerf Inference Datacenter suite. The Network division is designed to simulate a real data center setup and strives to include the effect of networking��including both hardware and software��in end-to-end inference performance. In the Network division��

]]> 0 Amit Katz <![CDATA[Navigating Generative AI for Network Admins]]> http://www.open-lab.net/blog/?p=63314 2023-06-01T19:08:41Z 2023-05-25T16:00:00Z

We all know that AI is changing the world. For network admins, AI can improve day-to-day operations in some amazing ways: Automation of repetitive tasks: This...]]>

We all know that AI is changing the world. For network admins, AI can improve day-to-day operations in some amazing ways: Automation of repetitive tasks: This...

NVIDIA-DataCenter-Lifestyle-2023-7009

We all know that AI is changing the world. For network admins, AI can improve day-to-day operations in some amazing ways: However, AI is no replacement for the know-how of an experienced network admin. AI is meant to augment your capabilities, like a virtual assistant. So, AI may become your best friend, but generative AI is also a new data center workload that brings a new paradigm��

]]> 0 Jiao Dong <![CDATA[Efficiently Scale LLM Training Across a Large GPU Cluster with Alpa and Ray]]> http://www.open-lab.net/blog/?p=64352 2023-07-05T19:21:22Z 2023-05-15T21:23:48Z

Recent years have seen a proliferation of large language models (LLMs) that extend beyond traditional language tasks to generative AI. This includes models like...]]>

Recent years have seen a proliferation of large language models (LLMs) that extend beyond traditional language tasks to generative AI. This includes models like... LLM graphic

LLM graphic

Recent years have seen a proliferation of large language models (LLMs) that extend beyond traditional language tasks to generative AI. This includes models like ChatGPT and Stable Diffusion. As this generative AI focus continues to grow, there is a rising need for a modern machine learning (ML) infrastructure that makes scalability accessible to the everyday practitioner.

]]> 0 Ashraf Eassa <![CDATA[Setting New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for AI]]> http://www.open-lab.net/blog/?p=62958 2023-07-05T19:23:50Z 2023-04-05T19:10:55Z

The most exciting computing applications currently rely on training and running inference on complex AI models, often in demanding, real-time deployment...]]>

The most exciting computing applications currently rely on training and running inference on complex AI models, often in demanding, real-time deployment...

hpc-mlperf-inference-v3.0

The most exciting computing applications currently rely on training and running inference on complex AI models, often in demanding, real-time deployment scenarios. High-performance, accelerated AI platforms are needed to meet the demands of these applications and deliver the best user experiences. New AI models are constantly being invented to enable new capabilities��

]]> 0 Rama Darbha <![CDATA[Optimizing Your Data Center Network]]> http://www.open-lab.net/blog/?p=48521 2022-06-02T17:16:18Z 2022-05-24T22:03:41Z

Data centers can be optimized by updating key network architectures in two ways: through networking technologies or operational efficiency in NetDevOps. In this...]]>

Data centers can be optimized by updating key network architectures in two ways: through networking technologies or operational efficiency in NetDevOps. In this...

5-tips-optimize-data-center-network-1260x680-1920x1080-v2

Data centers can be optimized by updating key network architectures in two ways: through networking technologies or operational efficiency in NetDevOps. In this post, we identify and evaluate technologies that you can apply to your network architecture to optimize your network. We address five updates that you should consider for improving your data center: VXLAN is an overlay��

]]> 0 Chaitrali Joshi <![CDATA[Announcing NVIDIA Nsight Systems 2021.5]]> http://www.open-lab.net/blog/?p=40362 2024-08-28T17:46:36Z 2021-11-10T15:00:00Z

The latest update to NVIDIA Nsight Systems��a performance analysis tool��is now available for download. Designed to help you tune and scale software across...]]>

The latest update to NVIDIA Nsight Systems��a performance analysis tool��is now available for download. Designed to help you tune and scale software across...

points-of-light (2)

The latest update to NVIDIA Nsight Systems��a performance analysis tool��is now available for download. Designed to help you tune and scale software across CPUs and GPUs, this release introduces several improvements aimed to enhance the profiling experience. Nsight Systems is part of the powerful debugging and profiling NVIDIA Nsight Tools Suite. You can start with Nsight Systems for an overall��

]]> 0 Scot Schultz <![CDATA[Accelerating Cloud-Native Supercomputing with Magnum IO]]> http://www.open-lab.net/blog/?p=40232 2023-03-22T01:16:54Z 2021-11-09T09:30:00Z

Supercomputers are significant investments. However they are extremely valuable tools for researchers and scientists. To effectively and securely share the...]]>

Supercomputers are significant investments. However they are extremely valuable tools for researchers and scientists. To effectively and securely share the...

social-NVIDIA MAGNUM IO FOR CLOUD NATIVE Architecture-1000x600 (1)

Supercomputers are significant investments. However they are extremely valuable tools for researchers and scientists. To effectively and securely share the computational might of these data centers, NVIDIA introduced the Cloud-Native Supercomputing architecture. It combines bare metal performance, multitenancy, and performance isolation for supercomputing. Magnum IO, the I/

]]> 2 David Slama <![CDATA[Managing Data Centers Securely and Intelligently with NVIDIA UFM Cyber-AI]]> http://www.open-lab.net/blog/?p=33858 2022-08-21T23:52:02Z 2021-06-28T07:01:00Z

Today��s data centers host many users and a wide variety of applications. They have even become the key element of competitive advantage for research,...]]>

Today��s data centers host many users and a wide variety of applications. They have even become the key element of competitive advantage for research,...

nvidia-ufm-cyber-ai

Today��s data centers host many users and a wide variety of applications. They have even become the key element of competitive advantage for research, technology, and global industries. With the increased complexity of scientific computing, data center operational costs also continue to rise. In addition to the operational disruption of security threats, keeping a data center intact and running��

]]> 0 CJ Newburn <![CDATA[Accelerating IO in the Modern Data Center: Computing and IO Management]]> http://www.open-lab.net/blog/?p=23756 2023-07-11T23:17:10Z 2021-02-06T01:29:32Z

This is the third post in the Accelerating IO series, which has the goal of describing the architecture, components, and benefits of Magnum IO, the IO subsystem...]]>

This is the third post in the Accelerating IO series, which has the goal of describing the architecture, components, and benefits of Magnum IO, the IO subsystem...

infiniband-app-performance-results (2)

This is the third post in the Accelerating IO series, which has the goal of describing the architecture, components, and benefits of Magnum IO, the IO subsystem of the modern data center. The first post in this series introduced the Magnum IO architecture; positioned it in the broader context of CUDA, CUDA-X, and vertical application domains; and listed the four major components of the��

]]> 0 Sylvain Jeaugey <![CDATA[Scaling Deep Learning Training with NCCL]]> http://www.open-lab.net/blog/?p=12093 2022-08-21T23:39:08Z 2018-09-26T17:30:03Z

NVIDIA Collective Communications Library (NCCL)?provides optimized implementation of inter-GPU communication operations, such as allreduce and variants....]]>

NVIDIA Collective Communications Library (NCCL)?provides optimized implementation of inter-GPU communication operations, such as allreduce and variants....

dgx-2_square

NVIDIA Collective Communications Library (NCCL) provides optimized implementation of inter-GPU communication operations, such as allreduce and variants. Developers using deep learning frameworks can rely on NCCL��s highly optimized, MPI compatible and topology aware routines, to take full advantage of all available GPUs within and across multiple nodes. NCCL is optimized for high bandwidth and��

]]> 1 ��˳��97caoporen��