AI factories rely on more than just compute fabrics. While the East-West network connecting the GPUs is critical to AI application performance, the storage fabric��connecting high-speed storage arrays��is equally important. Storage performance plays a key role across several stages of the AI lifecycle, including training checkpointing, inference techniques such as retrieval-augmented generation��
]]>Migrating between major versions of software can present several challenges to the infrastructure management teams: These challenges can prevent users from adopting the newer versions, so they miss out on newer, more powerful features. Effective planning and thorough testing are essential to overcoming these challenges and ensuring a smooth transition. Cumulus Linux 3.7.x and 4.x.
]]>Data center automation dates to the early days of the mainframe, with operational efficiency topping the list of its benefits. Over the years, technologies have changed both inside and outside the data center. As a result, tools and approaches have evolved as well. The NVIDIA NVUE Collection and Ansible aim to simplify your network automation journey by providing a comprehensive list of��
]]>For HPC clusters purposely built for AI training, such as the NVIDIA DGX BasePOD and NVIDIA DGX SuperPOD, fine-tuning the cluster is critical to increasing and optimizing the overall performance of the cluster. This includes fine-tuning the overall performance of the management fabric (based on Ethernet), storage fabric (Ethernet or InfiniBand), and the compute fabric (Ethernet or InfiniBand).
]]>Large language models (LLMs) and AI applications such as ChatGPT and DALL-E have recently seen rapid growth. Thanks to GPUs, CPUs, DPUs, high-speed storage, and AI-optimized software innovations, AI is now widely accessible. You can even deploy AI in the cloud or on-premises. Yet AI applications can be very taxing on the network, and this growth is burdening CPU and GPU servers��
]]>With evolving and ever-growing data centers, the days of simple networks that remained mostly unchanged are gone. Back then, when a configuration change was needed, it was simple for the network administrator to make the changes device per device, line-by-line. As data centers evolve from physical on-premises to digitized cloud infrastructures, the traditional networks have evolved too.
]]>AI has seamlessly integrated into our lives and changed us in ways we couldn��t even imagine just a few years ago. In the past, the perception of AI was something futuristic and complex. Only giant corporations used AI on their supercomputers with HPC technologies to forecast weather and make breakthrough discoveries in healthcare and science. Today, thanks to GPUs, CPUs, high-speed storage��
]]>Modern data centers can run thousands of services and applications. When an issue occurs, as a network administrator, you are guilty by default. You have to prove your innocence on a daily basis, as it is easy to blame the network. It is an unfair world. Correlating application performance issues to the network is hard to do. You can start by checking basic connectivity using simple pings or��
]]>In the old days of 10 Mbps Ethernet, long before Time-Sensitive Networking became a thing, state-of-the-art shared networks basically required that packets would collide. For the primitive technology of the time, this was eminently practical�� computationally preferable to any solution that would require carefully managed access to the medium. After mangling each other��s data��
]]>When network engineers engage with networking gear for the first time, they do it through a command-line interface (CLI). While CLI is still widely used, network scale has reached new highs, making CLI inefficient for managing and configuring the entire data center. Natively, networking is no exception as the software industry has progressed to automation. Network vendors have all provided��
]]>NVIDIA NetQ is a highly scalable, modern networking operations tool providing actionable visibility for the NVIDIA Spectrum Ethernet platform. It combines advanced telemetry with a user interface, making it easier to troubleshoot and automate network workflows while reducing maintenance and downtime. We have recently released NetQ 4.2.0, which includes: For more information about��
]]>Data center organizations are looking for more efficient, modern network architectures that can be managed, monitored, and deployed in a scalable manner. Emerging DevOps and NetDevOps operational models are bringing the agile development models of continuous integration and continuous development (CI/CD) to data center infrastructure. The Cumulus Linux operating system was built from the��
]]>Looking to try open networking for free? Try NVIDIA Cumulus VX��a free virtual appliance that provides all the features of NVIDIA Cumulus Linux. You can preview and test Cumulus Linux in your own environment, at your own pace, without organizational or economic barriers. You can also produce sandbox environments for prototype assessment, preproduction rollouts, and script development.
]]>Cumulus Linux 4.4 is the first release with the NVIDIA User Experience (NVUE), a brand new CLI for Cumulus Linux. Being excited about a new networking CLI sounds a bit like being excited about your new 56k modem. What makes NVUE special isn��t just that it��s a new CLI but it��s the principles it was built on that make it unique. At its core, NVUE has created a full object model of Cumulus Linux��
]]>