Sharath Sreenivas – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2024-12-18T01:43:12Z http://www.open-lab.net/blog/feed/ Sharath Sreenivas <![CDATA[Data-Efficient Knowledge Distillation for Supervised Fine-Tuning with NVIDIA NeMo-Aligner]]> http://www.open-lab.net/blog/?p=94082 2024-12-18T01:43:12Z 2024-12-18T01:43:09Z Knowledge distillation is an approach for transferring the knowledge of a much larger teacher model to a smaller student model, ideally yielding a compact,...]]>

Knowledge distillation is an approach for transferring the knowledge of a much larger teacher model to a smaller student model, ideally yielding a compact, easily deployable student with comparable accuracy to the teacher. Knowledge distillation has gained popularity in pretraining settings, but there are fewer resources available for performing knowledge distillation during supervised fine-tuning…

Source

]]>
Sharath Sreenivas <![CDATA[Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy]]> http://www.open-lab.net/blog/?p=87739 2024-10-17T18:51:42Z 2024-10-08T19:20:54Z This post was originally published August 21, 2024 but has been revised with current data. Recently, NVIDIA and Mistral AI unveiled Mistral NeMo 12B, a leading...]]>

This post was originally published August 21, 2024 but has been revised with current data. Recently, NVIDIA and Mistral AI unveiled Mistral NeMo 12B, a leading state-of-the-art large language model (LLM). Mistral NeMo 12B consistently outperforms similarly sized models on a wide range of benchmarks. We announced Mistral-NeMo-Minitron 8B, one of the most advanced open-access models in…

Source

]]>
Sharath Sreenivas <![CDATA[How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model]]> http://www.open-lab.net/blog/?p=87164 2024-08-22T18:24:58Z 2024-08-14T15:50:05Z Large language models (LLM) are now a dominant force in natural language processing and understanding, thanks to their effectiveness and versatility. LLMs such...]]>

Large language models (LLM) are now a dominant force in natural language processing and understanding, thanks to their effectiveness and versatility. LLMs such as Llama 3.1 405B and NVIDIA Nemotron-4 340B excel in many challenging tasks, including coding, reasoning, and math. They are, however, resource-intensive to deploy. As such, there is another trend in the industry to develop small language…

Source

]]>
7
Sharath Sreenivas <![CDATA[Pretraining BERT with Layer-wise Adaptive Learning Rates]]> http://www.open-lab.net/blog/?p=15981 2022-08-21T23:39:41Z 2019-12-05T18:39:10Z Training with larger batches is a straightforward way to scale training of deep neural networks to larger numbers of accelerators and reduce the training time....]]>

Training with larger batches is a straightforward way to scale training of deep neural networks to larger numbers of accelerators and reduce the training time. However, as the batch size increases, numerical instability can appear in the training process. The purpose of this post is to provide an overview of one class of solutions to this problem: layer-wise adaptive optimizers, such as LARS, LARC…

Source

]]>
0
���˳���97caoporen����