Deep learning recommender systems often use large embedding tables. It can be difficult to fit them in GPU memory. This post shows you how to use a combination of model parallel and data parallel training paradigms to solve this memory issue to train large deep learning recommender systems more quickly. I share the steps that my team took to efficiently train a 113 billion-parameter��
]]>NVIDIA Merlin is an open beta application framework and ecosystem that enables the end-to-end development of recommender systems, from data preprocessing to model training and inference, all accelerated on NVIDIA GPU. We announced Merlin in a previous post and have been continuously making updates to the open beta. In this post, we detail the new features added to the open beta NVIDIA Merlin��
]]>The MLPerf consortium mission is to ��build fair and useful benchmarks�� to provide an unbiased training and inference performance reference for ML hardware, software, and services. MLPerf Training v0.7 is the third instantiation for training and continues to evolve to stay on the cutting edge. This round consists of eight different workloads that cover a broad diversity of use cases��
]]>MLPerf is an industry-wide AI consortium that has developed a suite of performance benchmarks covering a range of leading AI workloads that are widely in use today. The latest MLPerf v0.7 training submission includes vision, language, recommenders, and reinforcement learning. NVIDIA submitted MLPerf v0.7 training results for all eight tests and the NVIDIA platform set records in all��
]]>Recommender systems help people find what they��re looking for among an exponentially growing number of options. They are a critical component for driving user engagement on many online platforms. With the rapid growth in scale of industry datasets, deep learning (DL) recommender models, which capitalize on large amounts of training data, have started to show advantages over traditional��
]]>