Justin Luitjens – NVIDIA Technical Blog

Justin Luitjens – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-21T20:22:31Z http://www.open-lab.net/blog/feed/ Justin Luitjens <![CDATA[Speeding up Numerical Computing in C++ with a Python-like Syntax in NVIDIA MatX]]> http://www.open-lab.net/blog/?p=44259 2024-03-07T00:54:10Z 2022-02-24T18:56:19Z

Rob Smallshire once said, "You can write faster code in C++, but write code faster in Python." Since its release more than a decade ago, CUDA has given C and...]]>

Rob Smallshire once said, “You can write faster code in C++, but write code faster in Python.” Since its release more than a decade ago, CUDA has given C and C++ programmers the ability to maximize the performance of their code on NVIDIA GPUs. More recently, libraries such as CuPy and PyTorch allowed developers of interpreted languages to leverage the speed of the optimized CUDA libraries…

]]> 0 Justin Luitjens <![CDATA[Extracting Features from Multiple Audio Channels with Kaldi]]> http://www.open-lab.net/blog/?p=19854 2022-08-21T23:40:35Z 2020-08-20T23:23:59Z

In automatic speech recognition (ASR), one widely used method combines traditional machine learning with deep learning. In ASR flows of this type, audio...]]>

In automatic speech recognition (ASR), one widely used method combines traditional machine learning with deep learning. In ASR flows of this type, audio features are first extracted from the raw audio. Features are then passed into an acoustic model. The acoustic model is a neural net trained on transcribed data to extract phoneme probabilities from the features. A phoneme is a single…

]]> 0 Justin Luitjens <![CDATA[GPU-Accelerated Speech to Text with Kaldi: A Tutorial on Getting Started]]> http://www.open-lab.net/blog/?p=15710 2025-03-21T20:22:31Z 2019-10-17T20:59:04Z

Sign up for the latest Speech AI news from NVIDIA. Recently, NVIDIA achieved GPU-accelerated speech-to-text inference with exciting performance results. That...]]>

Sign up for the latest Speech AI news from NVIDIA. Recently, NVIDIA achieved GPU-accelerated speech-to-text inference with exciting performance results. That post described the general process of the Kaldi ASR pipeline and indicated which of its elements the team accelerated, that is, implementing the decoder on the GPU and taking advantage of Tensor Cores in the acoustic model.

]]> 7 Justin Luitjens <![CDATA[NVIDIA Accelerates Real Time Speech to Text Transcription 3500x with Kaldi]]> http://www.open-lab.net/blog/?p=13915 2022-08-21T23:39:21Z 2019-03-18T15:00:51Z

Think of a sentence and repeat it aloud three times. If someone recorded this speech and performed a point-by-point comparison, they would find that no single...]]>

Think of a sentence and repeat it aloud three times. If someone recorded this speech and performed a point-by-point comparison, they would find that no single utterance exactly matched the others. Similar to different resolutions, angles, and lighting conditions in imagery, human speech varies with respect to timing, pitch, amplitude, and even how base units of speech – phonemes and morphemes…

]]> 5 Justin Luitjens <![CDATA[CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs]]> http://www.open-lab.net/blog/parallelforall/?p=3619 2022-08-21T23:37:08Z 2014-09-05T00:07:17Z

We often say that to reach?high performance on GPUs you should expose as much parallelism in your code as possible, and we don't mean just parallelism...]]>

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule…

]]> 4 Justin Luitjens <![CDATA[Faster Parallel Reductions on Kepler]]> http://www.open-lab.net/blog/parallelforall/?p=2551 2022-08-21T23:37:02Z 2014-02-14T04:30:44Z

Parallel reduction is a common building block for many parallel algorithms. A?presentation from 2007 by Mark Harris?provided a detailed strategy for...]]>

Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory…

]]> 53 Justin Luitjens <![CDATA[CUDA Pro Tip: Increase Performance with Vectorized Memory Access]]> http://www.open-lab.net/blog/parallelforall/?p=2287 2022-08-21T23:36:58Z 2013-12-04T18:37:25Z

Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it...]]>

]]> 23 ��˳��97caoporen��