NVIDIA NeMo is an end-to-end platform for the development of multimodal generative AI models at scale anywhere��on any cloud and on-premises. The NeMo team just released?Canary, a multilingual model that transcribes speech in English, Spanish, German, and French with punctuation and capitalization. Canary also provides bi-directional translation, between English and the three other supported��
]]>NVIDIA NeMo, an end-to-end platform for developing multimodal generative AI models at scale anywhere��on any cloud and on-premises��recently released Parakeet-TDT. This new addition to the?NeMo ASR Parakeet model family boasts better accuracy and 64% greater speed over the previously best model, Parakeet-RNNT-1.1B. This post explains Parakeet-TDT and how to use it to generate highly accurate��
]]>NVIDIA NeMo, an end-to-end platform for the development of multimodal generative AI models at scale anywhere��on any cloud and on-premises��released the Parakeet family of automatic speech recognition (ASR) models. These state-of-the-art ASR models, developed in collaboration with Suno.ai, transcribe spoken English with exceptional accuracy. This post details Parakeet ASR models that are��
]]>Speech and translation AI models developed at NVIDIA are pushing the boundaries of performance and innovation. The NVIDIA Parakeet automatic speech recognition (ASR) family of models and the NVIDIA Canary multilingual, multitask ASR and translation model currently top the Hugging Face Open ASR Leaderboard. In addition, a multilingual P-Flow-based text-to-speech (TTS) model won the LIMMITS ��24��
]]>Breaking barriers in speech recognition, NVIDIA NeMo proudly presents pretrained models tailored for Dutch and Persian��languages often overlooked in the AI landscape. These models leverage the recently introduced FastConformer architecture and were trained simultaneously with CTC and transducer objectives to maximize each model��s accuracy. Automatic speech recognition (ASR) is a��
]]>At the core of understanding people correctly and having natural conversations is automatic speech recognition (ASR). To make customer-led voice assistants and automate customer service interactions over the phone, companies must solve the unique challenge of gaining a caller��s trust through qualities such as understanding, empathy, and clarity. Telephony-bound voice is inherently challenging��
]]>Generative AI technologies are revolutionizing how games are produced and played. Game developers are exploring how these technologies can accelerate their content pipelines and provide new gameplay experiences previously thought impossible. One area of focus, digital avatars, will have a transformative impact on how gamers will interact with non-playable characters (NPCs). Historically��
]]>Meetings are the lifeblood of an organization. They foster collaboration and informed decision-making. They eliminate silos through brainstorming and problem-solving. And they further strategic goals and planning. Yet, leading meetings that accomplish these goals��especially those involving cross-functional teams and external participants��can be challenging. A unique blend of people��
]]>The integration of speech and translation AI into our daily lives is rapidly reshaping our interactions, from virtual assistants to call centers and augmented reality experiences. Speech AI Day provided valuable insights into the latest advancements in speech AI, showcasing how this technology addresses real-world challenges. In this first of three Speech AI Day sessions��
]]>Learn how to build and deploy production-quality conversational AI apps with real-time transcription and NLP.
]]>From start-ups to large enterprises, businesses use cloud marketplaces to find the new solutions needed to quickly transform their businesses. Cloud marketplaces are online storefronts where customers can purchase software and services with flexible billing models, including pay-as-you-go, subscriptions, and privately negotiated offers. Businesses further benefit from committed spending at��
]]>Audio can include a wide range of sounds, from human speech to non-speech sounds like barking dogs and sirens. When designing accessible applications for people with hearing difficulties, the application should be able to recognize sounds and understand speech. Such technology would help deaf or hard-of-hearing individuals with visualizing speech, like human conversations and non-speech��
]]>Voice-enabled technology is becoming ubiquitous. But many are being left behind by an anglocentric and demographically biased algorithmic world. Mozilla Common Voice (MCV) and NVIDIA are collaborating to change that by partnering on a public crowdsourced multilingual speech corpus��now the largest of its kind in the world��and open-source pretrained models. It is now easier than ever before to��
]]>According to Gartner,? ��Nearly half of digital workers struggle to find the data they need to do their jobs, and close to one-third have made a wrong business decision due to lack of information awareness.��1 To address this challenge, more and more enterprises are deploying AI in customer service, as it helps to provide more efficient and information-based personalized services.
]]>The telecom sector is transforming how communication happens. Striving to provide reliable, uninterrupted service, businesses are tackling the challenge of delivering an optimal customer experience. This optimal customer experience is something many long-time customers of large telecom service providers do not have. Take Jack, for example. His call was on hold for 10 minutes��
]]>The telecommunication industry has seen a proliferation of AI-powered technologies in recent years, with speech recognition and translation leading the charge. Multi-lingual AI virtual assistants, digital humans, chatbots, agent assists, and audio transcription are technologies that are revolutionizing the telco industry. Businesses are implementing AI in call centers to address incoming requests��
]]>This hands-on workshop guides you through the process of voice-enabling your product, from familiarizing yourself with NVIDIA Riva to assessing the costs and resources required for your project.
]]>Learn about the latest tools, trends, and technologies for building and deploying conversational AI.
]]>Explore the latest advances in accurate and customizable automatic speech recognition, multi-language translation, and text-to-speech.
]]>Over 55% of the global population uses social media, easily sharing online content with just one click. While connecting with others and consuming entertaining content, you can also spot harmful narratives posing real-life threats. That��s why VP of Engineering at Pendulum, Ammar Haris, wants his company��s AI to help clients to gain deeper insight into the harmful content being generated��
]]>Have you ever tried to fine-tune a speech recognition system on your accent only to find that, while it recognizes your voice well, it fails to detect words spoken by others? This is common in speech recognition systems that have trained on hundreds of thousands of hours of speech. In large-scale automatic speech recognition (ASR), a system may perform well in many but not all scenarios.
]]>Multilingual automatic speech recognition (ASR) models have gained significant interest because of their ability to transcribe speech in more than one language. This is fueled by the growing multilingual communities as well as by the need to reduce complexity. You only need one model to handle multiple languages. This post explains how to use pretrained multilingual NeMo ASR models from the��
]]>Learn to build an engaging and intelligent virtual assistant with NVIDIA AI workflows powered by NVIDIA Riva in this free hands-on lab from NVIDIA LaunchPad��
]]>Speech AI applications, from call centers to virtual assistants, rely heavily on automatic speech recognition (ASR) and text-to-speech (TTS). ASR can process the audio signal and transcribe the audio to text. Speech synthesis or TTS can generate high-quality, natural-sounding audio from the text in real time. The challenge of Speech AI is to achieve high accuracy and meet the latency requirements��
]]>From taking your order and serving you food in a restaurant to playing poker with you, service robots are becoming increasingly prevalent. Globally, you can find these service robots at hospitals, airports, and retail stores. According to Gartner, by 2030, 80% of humans will engage with smart robots daily, due to smart robot advancements in intelligence, social interactions��
]]>Speech is one of the primary means to communicate with an AI-powered application. From virtual assistants to digital avatars, voice-based interfaces are changing how we typically interact with smart devices. Deep learning techniques for speech recognition and speech synthesis are helping improve the user experience��think human-like responses and natural-sounding tones. If you plan to��
]]>As the global service economy grows, companies rely increasingly on contact centers to drive better customer experiences, increase customer satisfaction, and lower costs with increased efficiencies. Customer demand has increased far more rapidly than contact center employment ever could. Combined with the high agent churn rate, customer demand creates a need for more automated real-time customer��
]]>Virtual agents or voice-enabled assistants have been around for quite some time. But in the last decade, their usefulness and popularity have exploded with the use of AI. According to Gartner, virtual assistants will automate up to 75% of tasks for call center agents by 2025�Cup from 30% in 2021. This translates to a better experience for both contact center agents and customers.
]]>This post was updated in March 2023. Sign up for the latest Speech AI news from NVIDIA. Speech AI is used in a variety of applications, including contact centers�� agent assists for empowering human agents, voice interfaces for intelligent virtual assistants (IVAs), and live captioning in video conferencing. To support these features, speech AI technology includes automatic speech recognition��
]]>Real-time natural language understanding will transform how we interact with intelligent machines and applications.
]]>Speech AI is the ability of intelligent systems to communicate with users using a voice-based interface, which has become ubiquitous in everyday life. People regularly interact with smart home devices, in-car assistants, and phones through speech. Speech interface quality has improved leaps and bounds in recent years, making them a much more pleasant, practical, and natural experience than just a��
]]>Join experts from Google, Meta, NVIDIA, and more at the first annual NVIDIA Speech AI Summit. Register now!
]]>Learn how to build, train, customize, and deploy a GPU-accelerated automatic speech recognition service with NVIDIA Riva in this self-paced course.
]]>Speech recognition technology is growing in popularity for voice assistants and robotics, for solving real-world problems through assisted healthcare or education, and more. This is helping democratize access to speech AI worldwide. As labeled datasets for unique, emerging languages become more widely available, developers can build AI applications readily, accurately, and affordably to enhance��
]]>Build better GPU-accelerated Speech AI applications with the latest NVIDIA Riva updates, including enterprise support.
]]>When examining an intricate speech AI robotic system, it��s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, ��Any sufficiently advanced technology is indistinguishable from magic.�� From accepting natural-language commands to safely interacting in real-time with its environment and the humans around it, today��s speech AI robotics systems can perform tasks to��
]]>Text normalization (TN) converts text from written form into its verbalized form, and it is an essential preprocessing step before text-to-speech (TTS). TN ensures that TTS can handle all input texts without skipping unknown symbols. For example, ��$123�� is converted to ��one hundred and twenty-three dollars.�� Inverse text normalization (ITN) is a part of the automatic speech recognition (ASR)��
]]>Speaker diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question ��Who spoke when?��. It makes a clear distinction when it is compared with speech recognition. Before you perform speaker diarization, you know ��what is spoken�� but you don��t know ��who spoke it��. Therefore, speaker diarization is an essential feature for a speech recognition��
]]>Automatic speech recognition (ASR) research generally focuses on high-resource languages such as English, which is supported by hundreds of thousands of hours of speech. Recent literature has renewed focus on more complex languages, such as Japanese. Like other Asian languages, Japanese has a vast base character set (upwards of 3,000 unique characters are used in common vernacular)��
]]>Loss functions for training automatic speech recognition (ASR) models are not set in stone. The older rules of loss functions are not necessarily optimal. Consider connectionist temporal classification (CTC) and see how changing some of its rules enables you to reduce GPU memory, which is required for training and inference of CTC-based models and more. For more information about the��
]]>Learn about the latest tools, trends, and technologies for building and deploying conversational AI.
]]>Successfully deploying an automatic speech recognition (ASR) application can be a frustrating experience. For example, it is difficult for an ASR system to correctly identify words while maintaining low latency, considering the many different dialects and pronunciations that exist. Sign up for the latest Data Science news. Get the latest announcements, notebooks, hands-on tutorials, events��
]]>Automatic speech recognition (ASR) is becoming part of everyday life, from interacting with digital assistants to dictating text messages. ASR research continues to progress, thanks to recent advances: This post first introduces common ASR applications and then features two startups exploring unique applications of ASR as a core product capability. Sign up for the latest Data��
]]>Over the past decade, AI-powered speech recognition systems have slowly become part of our everyday lives, from voice search to virtual assistants in contact centers, cars, hospitals, and restaurants. These speech recognition developments are made possible by deep learning advancements. Sign up for the latest Data Science news. Get the latest announcements, notebooks, hands-on tutorials��
]]>Speech AI can assist human agents in contact centers, power virtual assistants and digital avatars, generate live captioning in video conferencing, and much more. Under the hood, these voice-based technologies orchestrate a network of automatic speech recognition (ASR) and text-to-speech (TTS) pipelines to deliver intelligent, real-time responses. Sign up for the latest Data Science news.
]]>Speech AI is the technology that makes it possible to communicate with computer systems using your voice. Commanding an in-car assistant or handling a smart home device? An AI-enabled voice interface helps you interact with devices without having to type or tap on a screen. Sign up for the latest Data Science news. Get the latest announcements, notebooks, hands-on tutorials, events��
]]>MLPerf benchmarks are developed by a consortium of AI leaders across industry, academia, and research labs, with the aim of providing standardized, fair, and useful measures of deep learning performance. MLPerf training focuses on measuring time to train a range of commonly used neural networks for the following tasks: Lower training times are important to speed time to deployment��
]]>Artificial intelligence (AI) has transformed synthesized speech from monotone robocalls and decades-old GPS navigation systems to the polished tone of virtual assistants in smartphones and smart speakers. It has never been so easy for organizations to use customized state-of-the-art speech AI technology for their specific industries and domains. Speech AI is being used to power virtual��
]]>Major updates to Riva, an SDK for building speech AI applications, and a paid Riva Enterprise offering were announced at NVIDIA GTC 2022 last week. Several key updates to the NeMo framework, a framework for training Large Language Models, were also announced. Riva offers world-class accuracy for real-time automatic speech recognition (ASR) and text-to-speech (TTS) skills across multiple��
]]>This month, NVIDIA released world-class speech-to-text models for Spanish, German, and Russian in Riva, powering enterprises to deploy speech AI applications globally. In addition, enterprises can now create expressive speech interfaces using Riva��s customizable text-to-speech pipeline. NVIDIA Riva is a GPU-accelerated speech AI SDK for developing real-time applications like live captioning��
]]>At NVIDIA GTC this November, new software tools were announced that help developers build real-time speech applications, optimize inference for a variety of use-cases, optimize open-source interoperability for recommender systems, and more. Watch the keynote from CEO, Jensen Huang, to learn about the latest NVIDIA breakthroughs. Today, NVIDIA unveiled a new version of NVIDIA Riva with a��
]]>In the past several months, many of us have grown accustomed to seeing our doctors over a video call. It��s certainly convenient, but after the call ends, those important pieces of advice from your doctor start to slip away. What was that new medication I needed to take? Were there any side effects to watch out for? Conversational AI can help in building an application to transcribe speech as��
]]>Sign up for the latest Speech AI news from NVIDIA. Virtual assistants have become part of our daily lives. We ask virtual assistants almost anything that we wonder about. In addition to providing convenience to our daily lives, virtual assistants are of tremendous help when it comes to enterprise applications. For example, we use online virtual agents to help navigate complex technical issues��
]]>This post is part of a series about generating accurate speech transcription. For part 1, see Speech Recognition: Generating Accurate Domain-Specific Audio Transcriptions Using NVIDIA Riva. For part 2, see Speech Recognition: Customizing Models to Your Domain Using Transfer Learning. NVIDIA Riva is an AI speech SDK for developing real-time applications like transcription, virtual assistants��
]]>This post is part of a series about generating accurate speech transcription. For part 1, see Speech Recognition: Generating Accurate Transcriptions Using NVIDIA Riva. For part 3, see Speech Recognition: Deploying Models to Production. Creating a new AI deep learning model from scratch is an extremely time�C and resource-intensive process. A common solution to this problem is to employ��
]]>This post is part of a series about generating accurate speech transcription. For part 2, see Speech Recognition: Customizing Models to Your Domain Using Transfer Learning. For part 3, see Speech Recognition: Deploying Models to Production. Every day millions of audio minutes are produced across several industries such as Telecommunications, Finance, and Unified Communications as a Service��
]]>Researchers from around the world working on speech applications are gathering this month for INTERSPEECH, a conference focused on the latest research and technologies in speech processing. NVIDIA researchers will present papers on groundbreaking research in speech recognition and speech synthesis. Conversational AI research is fueling innovations in speech processing that help computers��
]]>NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech synthesis (TTS). The primary objective of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models and make it easier to create new conversational AI models. NeMo is an open-source project��
]]>Today, NVIDIA released the Riva 1.0 Beta which includes an end-to-end workflow for building and deploying real-time conversational AI apps, such as transcription, virtual assistants and chatbots. Riva is an accelerated SDK for multimodal conversational AI services that delivers real-time performance on NVIDIA GPUs. This release of Riva includes new pretrained models for conversation AI and��
]]>This is an updated version of Neural Modules for Fast Development of Speech and Language Models. This post upgrades the NeMo diagram with PyTorch and PyTorch Lightning support and updates the tutorial with the new code base. As a researcher building state-of-the-art speech and language models, you must be able to quickly experiment with novel network architectures.
]]>Hospitals today are seeking to overhaul their existing digital infrastructure to improve their internal processes, deliver better patient care, and reduce operational expenses. Such a transition is required if hospitals are to cope with the needs of a burgeoning human population, accumulation of medical patient data, and a pandemic. The goal is not only to digitize existing infrastructure but��
]]>Seamlessly deploying AI services at scale in production is as critical as creating the most accurate AI model. Conversational AI services, for example, need multiple models handling functions of automatic speech recognition (ASR), natural language understanding (NLU), and text-to-speech (TTS) to complete the application pipeline. To provide real-time conversation to users��
]]>In automatic speech recognition (ASR), one widely used method combines traditional machine learning with deep learning. In ASR flows of this type, audio features are first extracted from the raw audio. Features are then passed into an acoustic model. The acoustic model is a neural net trained on transcribed data to extract phoneme probabilities from the features. A phoneme is a single��
]]>Speech processing is compute-intensive and requires a powerful and flexible platform to power modern conversational AI applications. It seemed natural to combine the de facto standard platform for automatic speech recognition (ASR), the Kaldi Speech Recognition Toolkit, with the power and flexibility of NVIDIA GPUs. Kaldi adopted GPU acceleration for training workloads early on.
]]>The artificial production of human speech, also known as speech synthesis, has always been a fascinating field for researchers, including our AI team at Axel Springer SE. For a long time, people have worked on creating text-to-speech (TTS) systems that reach human level. Following the field��s transition to deep learning with the introduction of Google WaveNet in 2006, it has almost reached this��
]]>In the past few years, voice-based interaction has become a feature of many industrial products. Voice platforms like Amazon Alexa, Google Home, Xiaomi Xiaz, Yandex Alice, and other in-home voice assistants provide easy-to-install, smart home technologies to even the least technologically savvy consumers. The fast adoption and rising performance of voice platforms drive interest in smart��
]]>Natural language processing (NLP) is one of the most challenging tasks for AI because it needs to understand context, phonics, and accent to convert human speech into text. Building this AI workflow starts with training a model that can understand and process spoken language to text. BERT is one of the best models for this task. Instead of starting from scratch to build state-of-the-art��
]]>OpenAI researchers recently released a paper describing the development of GPT-3, a state-of-the-art language model made up of 175 billion parameters. For comparison, the previous version, GPT-2, was made up of 1.5 billion parameters. The largest Transformer-based language model was released by Microsoft earlier this month and is made up of 17 billion parameters. ��GPT-3 achieves strong��
]]>Imagine an AI program that can understand language better than humans can. Imagine building your own personal Siri or Google Search for a customized domain or application. Google BERT (Bidirectional Encoder Representations from Transformers) provides a game-changing twist to the field of natural language processing (NLP). BERT runs on supercomputers powered by NVIDIA GPUs to train its��
]]>Sign up for the latest Speech AI news from NVIDIA. Conversational AI is the technology that allows us to communicate with machines like with other people. With the advent of sophisticated deep learning models, the human-machine communication has risen to unprecedented levels. However, these models are compute intensive, and hence require optimized code for flawless interaction. In this post��
]]>In simple terms, conversational AI is the use of natural language to communicate with machines. Deep learning applications in conversational AI are growing every day, from voice assistants and chatbots, to question answering systems that enable customer self-service. The range of industries adapting conversational AI into their solutions are wide, and have diverse domains extending from finance to��
]]>As computers and other personal devices have become increasingly prevalent, interest in conversational AI has grown due to its multitude of potential applications in a variety of situations. Each conversational AI framework is comprised of several more basic modules such as automatic speech recognition (ASR), and the models for these need to be lightweight in order to be effectively deployed on��
]]>Training with larger batches is a straightforward way to scale training of deep neural networks to larger numbers of accelerators and reduce the training time. However, as the batch size increases, numerical instability can appear in the training process. The purpose of this post is to provide an overview of one class of solutions to this problem: layer-wise adaptive optimizers, such as LARS, LARC��
]]>Sign up for the latest Speech AI news from NVIDIA. Recently, NVIDIA achieved GPU-accelerated speech-to-text inference with exciting performance results. That post described the general process of the Kaldi ASR pipeline and indicated which of its elements the team accelerated, that is, implementing the decoder on the GPU and taking advantage of Tensor Cores in the acoustic model.
]]>This post has been updated with Announcing NVIDIA NeMo: Fast Development of Speech and Language Models. The new version has information about pretrained models in NGC and fine-tuning models on custom dataset sections, upgrades the NeMo diagram with the text-to-speech collection, and replaces the AN4 dataset in the example with the LibriSpeech dataset. As a researcher building state-of-the��
]]>NVIDIA DGX SuperPOD trains BERT-Large in just 47 minutes, and trains GPT-2 8B, the largest Transformer Network Ever with 8.3Bn parameters Conversational AI is an essential building block of human interactions with intelligent machines and applications �C from robots and cars, to home assistants and mobile apps. Getting computers to understand human languages, with all their nuances��
]]>Large scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought about exciting leaps in state-of-the-art accuracy for many natural language understanding (NLU) tasks. Since its release in Oct 2018, BERT1 (Bidirectional Encoder Representations from Transformers) remains one of the most popular language models and still delivers state of the art accuracy at the time of writing2.
]]>Think of a sentence and repeat it aloud three times. If someone recorded this speech and performed a point-by-point comparison, they would find that no single utterance exactly matched the others. Similar to different resolutions, angles, and lighting conditions in imagery, human speech varies with respect to timing, pitch, amplitude, and even how base units of speech �C phonemes and morphemes��
]]>Speech recognition is an established technology, but it tends to fail when we need it the most, such as in noisy or crowded environments, or when the speaker is far away from the microphone. At Baidu we are working to enable truly ubiquitous, natural speech interfaces. In order to achieve this, we must improve the accuracy of speech recognition, especially in these challenging environments.
]]>