Brian Pharris – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2024-11-29T21:06:37Z http://www.open-lab.net/blog/feed/ Brian Pharris <![CDATA[3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot]]> http://www.open-lab.net/blog/?p=91412 2024-11-14T17:10:52Z 2024-11-01T22:00:36Z Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input...]]>

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands – and where input sequence lengths differ with each request – poses unique challenges. To achieve low latency inference in these environments, multi-GPU setups are a must – irrespective of the GPU generation or its memory capacity. To enhance inference performance in…

Source

]]>
1
Brian Pharris <![CDATA[Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance]]> http://www.open-lab.net/blog/?p=88938 2024-11-29T21:06:06Z 2024-09-26T21:44:00Z Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...]]>

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).

Source

]]>
Brian Pharris <![CDATA[Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch]]> http://www.open-lab.net/blog/?p=88127 2024-11-29T21:06:37Z 2024-09-05T18:30:00Z As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...]]>

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand. Performance depends both on the ability for the combined GPUs to process requests as “one mighty GPU” with ultra-fast GPU-to-GPU communication and advanced software able to take full…

Source

]]>
Brian Pharris <![CDATA[Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA]]> http://www.open-lab.net/blog/?p=54638 2023-07-05T19:26:31Z 2022-09-08T18:10:00Z Today��s AI-powered applications are enabling richer experiences, fueled by both larger and more complex AI models as well as the application of many models in...]]>

Today’s AI-powered applications are enabling richer experiences, fueled by both larger and more complex AI models as well as the application of many models in a pipeline. To meet the increasing demands of AI-infused applications, an AI platform must not only deliver high performance but also be versatile enough to deliver that performance across a diverse range of AI models.

Source

]]>
0
Brian Pharris <![CDATA[Nv-Wavenet: Better Speech Synthesis Using GPU-Enabled WaveNet Inference]]> http://www.open-lab.net/blog/?p=10169 2022-10-10T18:51:16Z 2018-04-23T18:43:32Z WaveNets?represent an exciting new neural network architecture used to generate raw audio waveforms, including the ability to synthesize very high quality...]]>

WaveNets represent an exciting new neural network architecture used to generate raw audio waveforms, including the ability to synthesize very high quality speech. These networks have proven challenging to deploy on CPUs, as generating speech in real-time or better requires substantial computation in tight timeframes. Fortunately, GPUs offer the tremendous parallel compute capability needed to make…

Source

]]>
2
���˳���97caoporen����