Rob Smallshire once said, “You can write faster code in C++, but write code faster in Python.” Since its release more than a decade ago, CUDA has given C and C++ programmers the ability to maximize the performance of their code on NVIDIA GPUs. More recently, libraries such as CuPy and PyTorch allowed developers of interpreted languages to leverage the speed of the optimized CUDA libraries…
]]>In automatic speech recognition (ASR), one widely used method combines traditional machine learning with deep learning. In ASR flows of this type, audio features are first extracted from the raw audio. Features are then passed into an acoustic model. The acoustic model is a neural net trained on transcribed data to extract phoneme probabilities from the features. A phoneme is a single…
]]>Sign up for the latest Speech AI news from NVIDIA. Recently, NVIDIA achieved GPU-accelerated speech-to-text inference with exciting performance results. That post described the general process of the Kaldi ASR pipeline and indicated which of its elements the team accelerated, that is, implementing the decoder on the GPU and taking advantage of Tensor Cores in the acoustic model.
]]>Think of a sentence and repeat it aloud three times. If someone recorded this speech and performed a point-by-point comparison, they would find that no single utterance exactly matched the others. Similar to different resolutions, angles, and lighting conditions in imagery, human speech varies with respect to timing, pitch, amplitude, and even how base units of speech – phonemes and morphemes…
]]>We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule…
]]>Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory…
]]>