With the recent advancements in generative AI and vision foundational models, VLMs present a new wave of visual computing wherein the models are capable of highly sophisticated perception and deep contextual understanding. These intelligent solutions offer a promising means of enhancing semantic comprehension in XR settings. By integrating VLMs, developers can significantly improve how XR��
]]>NVIDIA has consistently developed automatic speech recognition (ASR) models that set the benchmark in the industry. Earlier versions of NVIDIA Riva, a collection of GPU-accelerated speech and translation AI microservices for ASR, TTS, and NMT, support English-Spanish and English-Japanese code-switching ASR models based on the Conformer architecture, along with a model supporting multiple��
]]>Building a multimodal retrieval-augmented generation (RAG) system is challenging. The difficulty comes from capturing and indexing information from across multiple modalities, including text, images, tables, audio, video, and more. In our previous post, An Easy Introduction to Multimodal Retrieval-Augmented Generation, we discussed how to tackle text and images. This post extends this conversation��
]]>When interfacing with generative AI applications, users have multiple communication options��text, voice, or through digital avatars. Traditional chatbot or copilot applications have text interfaces where users type in queries and receive text-based responses. For hands-free communication, speech AI technologies like automatic speech recognition (ASR) and text-to-speech (TTS) facilitate��
]]>Providing customers with quality service remains a top priority for businesses across industries, from answering questions and troubleshooting issues to facilitating online orders. As businesses scale operations and expand offerings globally to compete, the demand for seamless customer service grows exponentially. Searching knowledge base articles or navigating complex phone trees can be a��
]]>NVIDIA NIM, part of NVIDIA AI Enterprise, provides containers to self-host GPU-accelerated inferencing microservices for pretrained and customized AI models across clouds, data centers, and workstations. NIM microservices for speech and translation are now available. The new speech and translation microservices leverage NVIDIA Riva and provide automatic speech recognition (ASR)��
]]>Speech and translation AI models developed at NVIDIA are pushing the boundaries of performance and innovation. The NVIDIA Parakeet automatic speech recognition (ASR) family of models and the NVIDIA Canary multilingual, multitask ASR and translation model currently top the Hugging Face Open ASR Leaderboard. In addition, a multilingual P-Flow-based text-to-speech (TTS) model won the LIMMITS ��24��
]]>Generative AI has the potential to transform every industry. Human workers are already using large language models (LLMs) to explain, reason about, and solve difficult cognitive tasks. Retrieval-augmented generation (RAG) connects LLMs to data, expanding the usefulness of LLMs by giving them access to up-to-date and accurate information. Many enterprises have already started to explore how��
]]>At the core of understanding people correctly and having natural conversations is automatic speech recognition (ASR). To make customer-led voice assistants and automate customer service interactions over the phone, companies must solve the unique challenge of gaining a caller��s trust through qualities such as understanding, empathy, and clarity. Telephony-bound voice is inherently challenging��
]]>Convai is a versatile developer platform for designing characters with advanced multimodal perception abilities. These characters are designed to integrate seamlessly into both the virtual and real worlds. Whether you��re a creator, game designer, or developer, Convai enables you to quickly modify a non-playable character (NPC), from backstory and knowledge to voice and personality.
]]>NVIDIA today unveiled major upgrades to the NVIDIA Avatar Cloud Engine (ACE) suite of technologies, bringing enhanced realism and accessibility to AI-powered avatars and digital humans. These latest animation and speech capabilities enable more natural conversations and emotional expressions. Developers can now easily implement and scale intelligent avatars across applications using new��
]]>Meetings are the lifeblood of an organization. They foster collaboration and informed decision-making. They eliminate silos through brainstorming and problem-solving. And they further strategic goals and planning. Yet, leading meetings that accomplish these goals��especially those involving cross-functional teams and external participants��can be challenging. A unique blend of people��
]]>The integration of speech and translation AI into our daily lives is rapidly reshaping our interactions, from virtual assistants to call centers and augmented reality experiences. Speech AI Day provided valuable insights into the latest advancements in speech AI, showcasing how this technology addresses real-world challenges. In this first of three Speech AI Day sessions��
]]>Learn how to build and deploy production-quality conversational AI apps with real-time transcription and NLP.
]]>From start-ups to large enterprises, businesses use cloud marketplaces to find the new solutions needed to quickly transform their businesses. Cloud marketplaces are online storefronts where customers can purchase software and services with flexible billing models, including pay-as-you-go, subscriptions, and privately negotiated offers. Businesses further benefit from committed spending at��
]]>Audio can include a wide range of sounds, from human speech to non-speech sounds like barking dogs and sirens. When designing accessible applications for people with hearing difficulties, the application should be able to recognize sounds and understand speech. Such technology would help deaf or hard-of-hearing individuals with visualizing speech, like human conversations and non-speech��
]]>Voice-enabled technology is becoming ubiquitous. But many are being left behind by an anglocentric and demographically biased algorithmic world. Mozilla Common Voice (MCV) and NVIDIA are collaborating to change that by partnering on a public crowdsourced multilingual speech corpus��now the largest of its kind in the world��and open-source pretrained models. It is now easier than ever before to��
]]>According to Gartner,? ��Nearly half of digital workers struggle to find the data they need to do their jobs, and close to one-third have made a wrong business decision due to lack of information awareness.��1 To address this challenge, more and more enterprises are deploying AI in customer service, as it helps to provide more efficient and information-based personalized services.
]]>The telecom sector is transforming how communication happens. Striving to provide reliable, uninterrupted service, businesses are tackling the challenge of delivering an optimal customer experience. This optimal customer experience is something many long-time customers of large telecom service providers do not have. Take Jack, for example. His call was on hold for 10 minutes��
]]>Generative AI technologies are revolutionizing how games are conceived, produced, and played. Game developers are exploring how these technologies impact 2D and 3D content-creation pipelines during production. Part of the excitement comes from the ability to create gaming experiences at runtime that would have been impossible using earlier solutions. The creation of non-playable characters��
]]>Join Infosys, NVIDIA, and Quantiphi on June 7 to learn how to use speech and translation AI to improve agent-assist solutions in multiple languages.
]]>Agent-assist technology uses AI and ML to provide facts and make real-time suggestions that help human agents across retail, telecom, and other industries conduct conversations with customers.
]]>The telecommunication industry has seen a proliferation of AI-powered technologies in recent years, with speech recognition and translation leading the charge. Multi-lingual AI virtual assistants, digital humans, chatbots, agent assists, and audio transcription are technologies that are revolutionizing the telco industry. Businesses are implementing AI in call centers to address incoming requests��
]]>Join Infosys, Quantiphi, Talkmap, and NVIDIA on May 31 for a live webinar to learn how telecommunications companies are using AI to improve operational efficiency and enhance customer engagement.
]]>When interacting with a virtual assistant, you give a command and receive a verbal response. The technology powering this generated voice response is known as text-to-speech (TTS). TTS applications are highly useful as they enable greater content accessibility for those who use assistive devices. With the latest TTS techniques, you can generate a synthetic voice from only a few minutes of��
]]>This hands-on workshop guides you through the process of voice-enabling your product, from familiarizing yourself with NVIDIA Riva to assessing the costs and resources required for your project.
]]>Project Mellon is a lightweight Python package capable of harnessing the heavyweight power of speech AI (NVIDIA Riva) and large language models (LLMs) (NVIDIA NeMo service) to simplify user interactions in immersive environments. NVIDIA announced at NVIDIA GTC 2023 that developers can start testing Project Mellon to explore creating hands-free extended reality (XR) experiences controlled by��
]]>NVIDIA showed how AI workflows can be leveraged to help you accelerate the development of AI solutions to address a range of use cases at NVIDIA GTC 2023. AI workflows are cloud-native, packaged reference examples showing how NVIDIA AI frameworks can be used to efficiently build AI solutions such as intelligent virtual assistants, digital fingerprinting for cybersecurity��
]]>Learn about advancements in video conferencing that have transformed how we communicate.
]]>Learn about the latest tools, trends, and technologies for building and deploying conversational AI.
]]>Explore the latest advances in accurate and customizable automatic speech recognition, multi-language translation, and text-to-speech.
]]>Over 55% of the global population uses social media, easily sharing online content with just one click. While connecting with others and consuming entertaining content, you can also spot harmful narratives posing real-life threats. That��s why VP of Engineering at Pendulum, Ammar Haris, wants his company��s AI to help clients to gain deeper insight into the harmful content being generated��
]]>Learn to build an engaging and intelligent virtual assistant with NVIDIA AI workflows powered by NVIDIA Riva in this free hands-on lab from NVIDIA LaunchPad��
]]>Speech AI applications, from call centers to virtual assistants, rely heavily on automatic speech recognition (ASR) and text-to-speech (TTS). ASR can process the audio signal and transcribe the audio to text. Speech synthesis or TTS can generate high-quality, natural-sounding audio from the text in real time. The challenge of Speech AI is to achieve high accuracy and meet the latency requirements��
]]>Join this webinar on January 25 and learn how to build a voice-enabled intelligent virtual assistant to improve customer experiences at contact centers.
]]>From taking your order and serving you food in a restaurant to playing poker with you, service robots are becoming increasingly prevalent. Globally, you can find these service robots at hospitals, airports, and retail stores. According to Gartner, by 2030, 80% of humans will engage with smart robots daily, due to smart robot advancements in intelligence, social interactions��
]]>As the global service economy grows, companies rely increasingly on contact centers to drive better customer experiences, increase customer satisfaction, and lower costs with increased efficiencies. Customer demand has increased far more rapidly than contact center employment ever could. Combined with the high agent churn rate, customer demand creates a need for more automated real-time customer��
]]>This post was updated in March 2023. Sign up for the latest Speech AI news from NVIDIA. Speech AI is used in a variety of applications, including contact centers�� agent assists for empowering human agents, voice interfaces for intelligent virtual assistants (IVAs), and live captioning in video conferencing. To support these features, speech AI technology includes automatic speech recognition��
]]>Speech AI is the ability of intelligent systems to communicate with users using a voice-based interface, which has become ubiquitous in everyday life. People regularly interact with smart home devices, in-car assistants, and phones through speech. Speech interface quality has improved leaps and bounds in recent years, making them a much more pleasant, practical, and natural experience than just a��
]]>Learn how to build, train, customize, and deploy a GPU-accelerated automatic speech recognition service with NVIDIA Riva in this self-paced course.
]]>Build better GPU-accelerated Speech AI applications with the latest NVIDIA Riva updates, including enterprise support.
]]>When examining an intricate speech AI robotic system, it��s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, ��Any sufficiently advanced technology is indistinguishable from magic.�� From accepting natural-language commands to safely interacting in real-time with its environment and the humans around it, today��s speech AI robotics systems can perform tasks to��
]]>At GTC 2022, NVIDIA introduced enhancements to AI frameworks for building real-time speech AI applications, designing high-performing recommenders at scale, applying AI to cybersecurity challenges, creating AI-powered medical devices, and more. Showcased real-world, end-to-end AI frameworks highlighted the customers and partners leading the way in their industries and domains.
]]>Successfully deploying an automatic speech recognition (ASR) application can be a frustrating experience. For example, it is difficult for an ASR system to correctly identify words while maintaining low latency, considering the many different dialects and pronunciations that exist. Sign up for the latest Data Science news. Get the latest announcements, notebooks, hands-on tutorials, events��
]]>Speech AI can assist human agents in contact centers, power virtual assistants and digital avatars, generate live captioning in video conferencing, and much more. Under the hood, these voice-based technologies orchestrate a network of automatic speech recognition (ASR) and text-to-speech (TTS) pipelines to deliver intelligent, real-time responses. Sign up for the latest Data Science news.
]]>Learn how NVIDIA Inception member Minerva CQ is using NVIDIA Riva to deliver faster, personalized experiences within a global EV charging and electric mobility company.
]]>Major updates to Riva, an SDK for building speech AI applications, and a paid Riva Enterprise offering were announced at NVIDIA GTC 2022 last week. Several key updates to the NeMo framework, a framework for training Large Language Models, were also announced. Riva offers world-class accuracy for real-time automatic speech recognition (ASR) and text-to-speech (TTS) skills across multiple��
]]>Join us at GTC, March 21-24, to explore the latest technology and research across AI, computer vision, data science, robotics, and more! With over 900 options to choose from, our NVIDIA experts put together some can��t-miss sessions to help get you started: How to Design Collaborative AR and VR worlds in Omniverse Omer Shapira, Senior Engineer, Omniverse��
]]>Join us at GTC, March 21-24, to explore the latest technology and research across AI, computer vision, data science, robotics, and more! With over 900 options to choose from, our NVIDIA experts put together some can��t-miss sessions to help get you started: Creating the Future: Creating the World��s Largest Synthetic Object Recognition Dataset for Industry��
]]>Our weekly roundup covers the most recent software updates, learning resources, events, and notable news. Software releases The redesigned nvCOMP 2.2.0 interface provides a single nvcompManagerBase object that can do compression and decompression. Users can now decompress nvcomp-compressed files without knowing how they were compressed. The interface also can��
]]>This past year, NVIDIA announced several major breakthroughs in conversational AI for building and deploying automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications. To get developers started with some quick examples in a cloud GPU-accelerated environment, NVIDIA Deep Learning Institute (DLI) is offering three fast, free, self-paced courses.
]]>This month, NVIDIA released world-class speech-to-text models for Spanish, German, and Russian in Riva, powering enterprises to deploy speech AI applications globally. In addition, enterprises can now create expressive speech interfaces using Riva��s customizable text-to-speech pipeline. NVIDIA Riva is a GPU-accelerated speech AI SDK for developing real-time applications like live captioning��
]]>At NVIDIA GTC this November, new software tools were announced that help developers build real-time speech applications, optimize inference for a variety of use-cases, optimize open-source interoperability for recommender systems, and more. Watch the keynote from CEO, Jensen Huang, to learn about the latest NVIDIA breakthroughs. Today, NVIDIA unveiled a new version of NVIDIA Riva with a��
]]>NVIDIA recently unveiled new breakthroughs in NVIDIA Riva for speech AI, and NVIDIA NeMo for large-scale language modeling (LLM). Riva is a GPU-accelerated Speech AI SDK for enterprises to generate expressive human-like speech for their brand and virtual assistants. NeMo is an accelerated training framework for speech and NLU, that now has the capabilities to develop large-scale language models��
]]>Sign up for the latest Speech AI news from NVIDIA. Conversational AI is a set of technologies enabling human-like interactions between humans and devices based on the most natural interfaces for us: speech and natural language. Systems based on conversational AI can understand commands by recognizing speech and text, translating on-the-fly between different languages��
]]>In the past several months, many of us have grown accustomed to seeing our doctors over a video call. It��s certainly convenient, but after the call ends, those important pieces of advice from your doctor start to slip away. What was that new medication I needed to take? Were there any side effects to watch out for? Conversational AI can help in building an application to transcribe speech as��
]]>Sign up for the latest Speech AI news from NVIDIA. Virtual assistants have become part of our daily lives. We ask virtual assistants almost anything that we wonder about. In addition to providing convenience to our daily lives, virtual assistants are of tremendous help when it comes to enterprise applications. For example, we use online virtual agents to help navigate complex technical issues��
]]>Sign up for the latest Speech AI news from NVIDIA. There is a high chance that you have asked your smart speaker a question like, ��How tall is Mount Everest?�� If you did, it probably said, ��Mount Everest is 29,032 feet above sea level.�� Have you ever wondered how it found an answer for you? Question answering (QA) is loosely defined as a system consisting of information retrieval (IR)��
]]>This post is part of a series about generating accurate speech transcription. For part 1, see Speech Recognition: Generating Accurate Domain-Specific Audio Transcriptions Using NVIDIA Riva. For part 2, see Speech Recognition: Customizing Models to Your Domain Using Transfer Learning. NVIDIA Riva is an AI speech SDK for developing real-time applications like transcription, virtual assistants��
]]>This post is part of a series about generating accurate speech transcription. For part 1, see Speech Recognition: Generating Accurate Transcriptions Using NVIDIA Riva. For part 3, see Speech Recognition: Deploying Models to Production. Creating a new AI deep learning model from scratch is an extremely time�C and resource-intensive process. A common solution to this problem is to employ��
]]>This post is part of a series about generating accurate speech transcription. For part 2, see Speech Recognition: Customizing Models to Your Domain Using Transfer Learning. For part 3, see Speech Recognition: Deploying Models to Production. Every day millions of audio minutes are produced across several industries such as Telecommunications, Finance, and Unified Communications as a Service��
]]>Soon, the industrial internet will have hundreds of billions of connected industrial assets continuously operating at computer speed. This will result in large amounts of data from shop-floor machines and sensors. Analyzing operations data to predict operational anomalies, machine failures, and product quality, while improving factory floor operations with industrial AI could yield productivity��
]]>The audio and video quality of real-time communication applications such as virtual collaboration and content creation applications is the true gauge of users�� real-time communication experience. They rely heavily on network bandwidth and user equipment quality. Narrow network bandwidth and low-quality equipment produce unstable and noisy audio and video outputs. This problem is often��
]]>As the world continues to evolve and become more digital, conversational AI is increasingly used as a means for automation. This technology has been shown to improve customer experience and efficiency, across various industries and applications. The NVIDIA Deep Learning Institute is hosting a workshop on how to build a conversational AI service using the NVIDIA Riva framework.
]]>Deep learning is proving to be a powerful tool when it comes to high-quality synthetic speech development and customization. A Toronto-based startup, and NVIDIA Inception member, Resemble AI is upping the stakes with a new generative voice tool able to create high-quality synthetic AI Voices. The technology can generate cross-lingual and naturally speaking voices in over 50 of the most��
]]>NVIDIA and Mozilla are proud to announce the latest release of the Common Voice dataset, with over 13,000 hours of crowd-sourced speech data, and adding another 16 languages to the corpus. Common Voice is the world��s largest open data voice dataset and designed to democratize voice technology. It is used by researchers, academics, and developers around the world.
]]>Today, NVIDIA announced new pretrained models and the general availability of TAO Toolkit 3.0, a core component of the NVIDIA Train, Adapt, and Optimize (TAO) platform-guided workflow for creating AI. The new release includes a variety of highly accurate and performant pretrained models in computer vision and conversational AI, as well as a set of powerful productivity features that boost AI��
]]>NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech synthesis (TTS). The primary objective of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models and make it easier to create new conversational AI models. NeMo is an open-source project��
]]>NVIDIA recently released NVIDIA Riva with world-class speech recognition capability for enterprises to generate highly accurate transcriptions and NVIDIA NeMo 1.0, which includes new state-of-the-art speech and language models for democratizing and accelerating conversational AI research. NVIDIA Riva world-class speech recognition is an out-of-the-box speech service that can be easily��
]]>At GTC 2021, NVIDIA announced new software tools to help developers build optimized conversational AI, recommender, and video solutions. Watch the keynote from CEO, Jensen Huang, for insights on all of the latest GPU technologies. Today NVIDIA announced major conversational AI capabilities in NVIDIA Riva that will help enterprises build engaging and accurate applications for their��
]]>Conversational AI is opening new ways for enterprises to interact with customers in every industry using applications like real-time transcription, translation, chatbots, and virtual assistants. Building domain-specific interactive applications requires state-of-the-art models, optimizations for real-time performance, and tools to adapt those models with your data. This week at GTC��
]]>Many of you may not recognize my company, Ribbon Communications. We are best known for building and securing large telecom networks for communication service providers (also known as phone companies). However, there��s a good chance that in the next day or two, you��ll place a call that traverses a piece of our gear somewhere in the world. In addition to service providers��
]]>The NVIDIA NGC catalog is a hub for GPU-optimized deep learning, machine learning and high-performance computing (HPC) applications. With highly performant software containers, pre-trained models, industry specific SDKs and Helm Charts, the content available on the catalog helps you simplify and accelerate your end-to-end workflows. The NVIDIA NGC team works closely with our internal and��
]]>Today, NVIDIA released the Riva 1.0 Beta which includes an end-to-end workflow for building and deploying real-time conversational AI apps, such as transcription, virtual assistants and chatbots. Riva is an accelerated SDK for multimodal conversational AI services that delivers real-time performance on NVIDIA GPUs. This release of Riva includes new pretrained models for conversation AI and��
]]>At GTC 2020, NVIDIA announced updates to 80 SDKs, including tools to help you build AI-powered video streaming solutions, conversational AI, recommendation systems, and more. Today, we announced NVIDIA Maxine, a cloud-native video streaming AI platform for services such as video conferencing. It includes state-of-the-art AI models and optimized pipelines that can run several��
]]>This is an updated version of Neural Modules for Fast Development of Speech and Language Models. This post upgrades the NeMo diagram with PyTorch and PyTorch Lightning support and updates the tutorial with the new code base. As a researcher building state-of-the-art speech and language models, you must be able to quickly experiment with novel network architectures.
]]>NVIDIA Riva is an application framework that provides several pipelines for accomplishing conversational AI tasks. Generating high-quality, natural-sounding speech from text with low latency, also known as text-to-speech (TTS), can be one of the most computationally challenging of those tasks. In this post, we focus on optimizations made to a TTS pipeline in Riva, as shown in Figure 1.
]]>At GTC 2020, NVIDIA announced and shipped a range of new AI SDKs, enabling developers to support the new Ampere architecture. For the first time, developers have the tools to build end-to-end deep learning-based pipelines for conversational AI and recommendation systems. Today, NVIDIA announced Riva, a fully accelerated application framework building multimodal conversational AI services.
]]>