AI factories rely on more than just compute fabrics. While the East-West network connecting the GPUs is critical to AI application performance, the storage fabric��connecting high-speed storage arrays��is equally important. Storage performance plays a key role across several stages of the AI lifecycle, including training checkpointing, inference techniques such as retrieval-augmented generation��
]]>The growth of AI is driving exponential growth in computing power and a doubling of networking speeds every few years. Less well-known is that it��s also putting new demands on storage. Training new models typically requires high-bandwidth networked access to petabytes of data, while inference with the latest types of retrieval augmented generation (RAG) requires low-latency access to��
]]>As the demand for sophisticated AI capabilities escalates, VAST Data introduces the VAST Data Platform, now enhanced with NVIDIA BlueField DPUs. This innovation is tailored to meet the stringent demands of AI-driven data centers and optimize AI workloads and data management. This post presents how BlueField DPUs provide VAST with a significant boost in both performance and efficiency to��
]]>As AI becomes integral to organizational innovation and competitive advantage, the need for efficient and scalable infrastructure is more critical than ever. A partnership between NVIDIA and DDN Storage is setting new standards in this area. By integrating NVIDIA BlueField DPUs into DDN EXAScaler and DDN Infinia and using them innovatively, DDN Storage is transforming data-centric workloads.
]]>Identity-based attacks are on the rise, with phishing remaining the most common and second-most expensive attack vector. Some attackers are using AI to craft more convincing phishing messages and deploying bots to get around automated defenses designed to spot suspicious behavior. At the same time, a continued increase in enterprise applications introduces challenges for IT teams who must��
]]>As data generation continues to increase, linear performance scaling has become an absolute requirement for scale-out storage. Storage networks are like car roadway systems: if the road is not built for speed, the potential speed of a car does not matter. Even a Ferrari is slow on an unpaved dirt road full of obstacles. Scale-out storage performance can be hindered by the Ethernet fabric��
]]>At COMPUTEX 2023, NVIDIA announced the NVIDIA DGX GH200, which marks another breakthrough in GPU-accelerated computing to power the most demanding giant AI workloads. In addition to describing critical aspects of the NVIDIA DGX GH200 architecture, this post discusses how NVIDIA Base Command enables rapid deployment, accelerates the onboarding of users, and simplifies system management.
]]>Announced in March 2023, NVIDIA DOCA 2.0, the newest release of the NVIDIA SDK for BlueField DPUs, is now available. Together, NVIDIA DOCA and BlueField DPUs accelerate the development of applications that deliver breakthrough networking, security, and storage performance with a comprehensive, open development platform. NVIDIA DOCA 2.0 includes newly added support for the BlueField-3 Data��
]]>There are many benefits of GPUs in scaling AI, ranging from faster model training to GPU-accelerated fraud detection. While planning AI models and deployed apps, scalability challenges��especially performance and storage��must be accounted for. Regardless of the use case, AI solutions have four elements in common: Of these elements, data storage is often the most neglected during��
]]>Supercomputers are used to model and simulate the most complex processes in scientific computing, often for insight into new discoveries that otherwise would be impractical or impossible to demonstrate physically. The NVIDIA BlueField data processing unit (DPU) is transforming high-performance computing (HPC) resources into more efficient systems, while accelerating problem solving across a��
]]>Artificial intelligence (AI) is becoming pervasive in the enterprise. Speech recognition, recommenders, and fraud detection are just a few applications among hundreds being driven by AI and deep learning (DL) To support these AI applications, businesses look toward optimizing AI servers and performance networks. Unfortunately, storage infrastructure requirements are often overlooked in the��
]]>NVIDIA Zero Touch RoCE (ZTR) enables data centers to seamlessly deploy RDMA over Converged Ethernet (RoCE) without requiring any special switch configuration. Until recently, ZTR was optimal for only small to medium-sized data centers. Meanwhile, large-scale deployments have traditionally relied on Explicit Congestion Notification (ECN) to enable RoCE network transport��
]]>This post is an update to an older post. Deep learning models require training with vast amounts of data to achieve accurate results. Raw data usually cannot be directly fed into a neural network due to various reasons such as different storage formats, compression, data format and size, and limited amount of high-quality data. Addressing these issues requires extensive data preparation��
]]>Does the switch matter? The network fabric is key to the performance of modern data centers. There are many requirements for data center switches, but the most basic is to provide equal amounts of bandwidth to all clients so that resources are shared evenly. Without fair networking, all workloads experience unpredictable performance due to throughput deterioration, delay��
]]>DOCA is a software framework for developing applications on BlueField DPUs. By using DOCA, you can offload infrastructure workloads from the host CPU and accelerate them with the BlueField DPU. This enables an infrastructure that is software-defined yet hardware accelerated, maximizing both performance and flexibility in the data center. NVIDIA first introduced DOCA in October 2020.
]]>Cloud technologies are increasingly taking over the worldwide IT infrastructure market. With offerings that include elastic compute, storage, and networking, cloud service providers (CSPs) allow customers to rapidly scale their IT infrastructure up and down without having to build and manage it on their own. The increasing demand for differentiated and cost-effective cloud products and services is��
]]>As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application��s performance. When considering end-to-end application performance, fast GPUs are increasingly starved by slow I/O. I/O, the process of loading data from storage to GPUs for processing, has historically been controlled by the CPU.
]]>Editor��s Note: This post has been updated. Here is the revised post. Training deep learning models with vast amounts of data is necessary to achieve accurate results. Data in the wild, or even prepared data sets, is usually not in the form that can be directly fed into neural network. This is where NVIDIA DALI data preprocessing comes into play. There are various reasons for that��
]]>When production systems are not delivering expected levels of performance, it can be a challenging and time-consuming task to root-cause the issue(s). Especially in today��s complex environments, where the workload is comprised of many software components, libraries, etc, and rely on virtually all of the underlying hardware subsystems (CPU, memory, disk IO, network IO) to deliver maximum throughput.
]]>