NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster

Enterprises are generating and storing more multimodal data than ever before, yet traditional retrieval systems remain largely text-focused. While they can surface insights from written content, they aren’t extracting critical information embedded in tables, charts, and infographics—often the most information-dense elements of a document.

Without a multimodal retrieval system, retrieval-augmented generation (RAG) users risk missing key insights hidden in these complex data formats, creating a significant blind spot in enterprise knowledge retrieval. Enter the NVIDIA AI Blueprint for RAG.

In this post, we’ll explore the latest advancements in the AI Blueprint for RAG and dive deep into the core technology under the hood—NVIDIA NeMo Retriever. Discover the latest benchmarks and see how NVIDIA partners are using this blueprint to efficiently extract, index, and query multimodal data and build agentic AI platforms.

Inside the blueprint: Fast data extraction and accurate retrieval

The AI Blueprint for RAG is a GPU-accelerated reference example that enables developers to build scalable, context-aware retrieval pipelines tailored to enterprise data. Linking LLMs with an organization’s existing knowledge base improves both accuracy and throughput, which are critical for modern generative AI applications. This section dives into the key technology driving efficient and scalable data extraction, optimized retrieval performance, and advanced enterprise capabilities.

Multimodal data extraction at scale

Instead of stopping at text alone, the blueprint can ingest and extract a variety of data types such as charts, tables, and infographics. These diverse modalities are handled through NVIDIA NIM—state-of-the-art models optimized on NVIDIA GPUs—enabling organizations to capture insights from a wide range of enterprise documents.

Leveraging the new NeMo Retriever extraction, embedding, and reranking microservices, built with NIM, benchmarks show a 15x throughput increase in multimodal data extraction. This speeds up the end-to-end retrieval workflow and enables businesses to continuously pull from the most up-to-date information for real-time decision-making (Figure 1).

A chart showing NeMo Retriever extraction microservices extracting more pages per second than an OSS alternative resulting in 15x improved throughput. — *Figure 1. NeMo Retriever extraction microservices extract more pages per second than an OSS alternative, resulting in 15x improved throughput?*

Requirements: Pages per second, evaluated on publicly available dataset of PDFs consisting of text, charts, and tables. NIM On includes the NeMo Retriever microservices: nv-yolox-structured-image-v1, nemoretriever-page-elements-v1, nemoretriever-graphic-elements-v1, nemoretriever-table-structure-v1, PaddleOCR, nv-llama3.2-embedqa-1b-v2 compared to NIM Off which is an OSS alternative; HW: 1x NVIDIA H100

To further enhance retrieval performance, the blueprint incorporates NeMo Retriever parse, an advanced VLM-based OCR inference microservice for text and table extraction. This microservice leverages a purpose-built autoregressive VLM to understand and preserve the semantic structure of text and tables, optimizing content for downstream retrieval. Designed for document transcription from images, the NIM microservice extracts text in reading order using Commercial RADIO (C-RADIO) for visual feature extraction and mBART for text generation.

Additionally, it identifies bounding boxes for text regions, classifies page artifacts (headers, paragraphs, and captions, for example) and outputs the structured text in markdown format. This approach retains both spatial layout and semantic structure, making transcriptions more organized and context-aware, ultimately augmenting retrieval capabilities.

The blueprint also leverages world-class NeMo Retriever embedding and reranking microservices, delivering 3x better embedding and 1.6x better reranking throughput over standard FP16 acceleration (Figure 2). This improvement enables developers to process larger datasets more efficiently, enabling them to build AI-powered search and retrieval systems.

For example, a customer support chatbot can quickly surface the most accurate troubleshooting guides from millions of support documents, delivering precise answers in real time and reducing customer wait times and improving resolution efficiency.

Two charts showing the NeMo Retriever embedding NIM generating more embeddings per second compared to an OSS alternative embedding model resulting in a 3x improved embedding throughput and then the NeMo Retriever reranking NIM outperforming an OSS alternative reranking model resulting in a 1.6x improved reranking throughput. — Figure 2. The NeMo Retriever embedding NIM generates more embeddings per second compared to an OSS alternative embedding model resulting in a 3x improved embedding throughput. The NeMo Retriever reranking NIM generates more rerankings per second compared to an OSS alternative reranking model resulting in a 1.6x improved reranking throughput

Requirements: The chart on the left: 1xH100 SXM; passage token length: 512, batch size: 64, concurrent client requests: 5; NIM Off is an OSS Alternative: FP16, while NIM On includes the NeMo Retriever embedding NIM: FP8. The chart on the right:1x H100 SXM; passage token length: 512, batch size: 40, concurrent client requests: 5; NIM Off is an OSS Alternative: FP16, while NIM On includes the NeMo Retriever reranking NIM: FP8

Faster and more accurate retrieval

Once the data is extracted, it needs to be indexed and stored efficiently for fast retrieval. The AI Blueprint for RAG accelerates this process using NVIDIA cuVS for creating scalable indexes, enabling large datasets to be indexed quickly and with minimal latency. The blueprint further optimized retrieval performance by employing a hybrid search strategy that combines traditional keyword-based (sparse) search with nearest neighbor (dense) vector search. This hybrid approach ensures precise and high-speed information retrieval, no matter the data type.

In addition, NeMo Retriever enhances storage efficiency with its dynamic length and long context support, reducing storage requirements by 35x. This not only lowers operational costs but also preserves retrieval speed, even with large volumes of data. By leveraging GPUs to accelerate indexing, developers can experience up to 7x better indexing throughput, which leads to improved scalability, real-time retrieval, and more responsible AI applications (Figure 3).

A chart showing English text embedding plus indexing for question-answering retrieval? chart on the CPU versus GPU. — *Figure 3. Leveraging GPUs to accelerate indexing results in 7x improved indexing throughput*

CPU indexing HW – fifth-generation Intel Xeon (192vCPU); GPU indexing HW – 8xL4; Embedding (nv-embedqa-e5-v5); segment size – 240K vectors (1024 Dim, fp32); Indexing – CAGRA (GPU), HNSW (CPU); Target Recall – 98%

The blueprint also delivers greater accuracy, reducing incorrect answers by 50% with NeMo Retriever multimodal extraction microservices (Figure 4). This means developers can build more reliable systems that provide consistent, relevant results in real-time, even as data scales.

A chart showing end-to-end multimodal retrieval accuracy with NVIDIA NeMo Retriever extraction NIM microservices. — Figure 4. Evaluated on publicly available dataset of PDFs consisting of text, charts, tables, and infographics, the NeMo Retriever extraction microservices resulted in higher multimodal retrieval recall@5 accuracy with 50% fewer incorrect answers when compared to an OSS alternative

Recall@5. NeMo Retriever extraction (NIM On): nemoretriever-page-elements-v2, nemoretriever-table-structure-v1, nemoretriever-graphic-elements-v1, paddle-ocr compared to open-source alternative (NIM Off): HW – 1x H100

Recognizing that each enterprise has its own unique data, proprietary terminology, and domain knowledge, the blueprint provides a path for customization. With NVIDIA NeMo microservices, developers can build a data flywheel to fine-tune models to meet specific business needs. This custom fine-tuning creates a feedback loop, improving accuracy for domain-specific queries and ensuring the retrieval system is tailored to unique enterprise requirements.

Advanced enterprise capabilities

The AI Blueprint for RAG is not just about speed and scalability. It also delivers key features for enterprises that need to manage complex workflows and support global operations.

For organizations that cater to a diverse, global audience, the blueprint supports multilingual and cross-lingual retrieval using NeMo Retriever microservices, making it easier to serve customers in different regions and languages.

A critical aspect of modern AI systems is the ability to maintain context over time. The blueprint also supports multiturn interactions and preserves context across multiple sessions, offering a seamless conversational experience. This capability is crucial for creating intelligent virtual assistants and chatbots that interact naturally with users.

Monitoring and observability are now built into the blueprint, as well as telemetry tools to help enterprises track usage, detect issues, and optimize performance—all crucial for enterprise-grade deployments. It offers features such as reflection to boost RAG accuracy and guardrails to align conversations with responsible AI guidelines through NVIDIA NeMo Guardrails microservices—all important capabilities in today’s regulatory environment.

Finally, the blueprint integrates easily with OpenAI-compatible APIs, which simplifies the integration process for existing teams familiar with LLM-based workflows. Its decomposable architecture enables developers to adopt only the components they need while adding new features or customizing existing ones as necessary. NVIDIA also packages a sample user interface to demonstrate how the system can be implemented in a real-world setting, further accelerating time to value.

By offering these advanced capabilities, customers can build their own enterprise-grade RAG pipelines with industry-leading performance, accuracy, and cost efficiency.

Revolutionizing enterprises and data platforms with RAG

Leading NVIDIA partners including Accenture, Cohesity, DataStax, DDN, Dell, Deloitte, HPE, IBM, NetApp, Nutanix, PureStorage, SAP, Siemens, Teradata, VAST Data, VMware, and WEKA are already adopting the AI Blueprint for RAG and NeMo Retriever microservices to securely connect custom models to diverse and large data sources enabling their systems and customers to access richer, more relevant information.

Accenture has integrated NeMo Retriever into the AI Refinery (AIR) platform, enhancing the efficiency of marketing teams in campaign creation and management. This integration reduced campaign development time from days to minutes while providing users with a scalable platform that ensures low latency and a short learning curve for seamless adoption.
DataStax has integrated NVIDIA NIM for high-performance inferencing, NeMo for model customization, and NeMo Retriever for multimodal data extraction and high-accuracy information retrieval. This supports extracting data from unstructured files such as PDFs and generating embeddings in the Astra DB vector store. Using NeMo Retriever capabilities integrated directly into the DataStax platform and Astra DB, Wikimedia added semantic search capabilities to Wikipedia in just three days, a 90% reduction in previous work time and 10x faster than their previous GPU-based solution.
DDN Infinia is revolutionizing AI-powered data intelligence with seamless, one-button deployment of a highly efficient question-answering RAG pipeline. By integrating NeMo Retriever, DDN Infinia has enabled a DDN customer in the automotive industry to automate question answering 20x faster than traditional cloud-based embedding services. This breakthrough accelerates vector embedding generation and indexing while reducing service costs by up to 80%, delivering unmatched efficiency. The result is a significant improvement in TCO and operational performance, making AI-powered decision-making more accessible and cost-effective.
Deloitte leverages NeMo Retriever extraction and embedding microservices, enabling users to ingest and transform diverse unstructured documents into a searchable, high-value knowledge base. They have seen up to a 35% faster document processing time and up to 8x improvement in average query response time.?
Cohesity integrated NeMo Retriever into their Cohesity Gaia solution, enabling a large manufacturing customer to tap into their extensive repository of research data—thousands of research papers in PDF format—and quickly find relevant answers within minutes. This has proven incredibly valuable, significantly accelerating their pace of research and discovery by saving the time previously spent searching for the right information.
VAST has seamlessly integrated NVIDIA LLM and NeMo Retriever embedding and reranking NIM microservices into its unified data platform, enhancing retrieval accuracy and model inference. This integration powers the VAST InsightEngine, optimizing AI deployments, improving response relevance, and unlocking the full potential of generative AI applications. With the VAST InsightEngine, the National Hockey League can unlock over 550,000 hours of historical game footage. This collaboration supports sponsorship analysis, helps video producers quickly create broadcast clips, and enhances personalized fan content.?
WEKA WARRP integrated NeMo Retriever, NVIDIA Triton, and NVIDIA TensorRT, to optimize its RAG architecture, accelerate multimodal data extraction (text, audio, images), enhance retrieval accuracy, and enable dynamic data management at scale. With this integration, WEKA can handle hundreds of millions of concurrent agents for enterprise-scale agentic swarm workloads.

Get started future-proofing your enterprise with RAG powered by NVIDIA NeMo Retriever

The AI landscape is evolving rapidly. Enterprises that fail to adopt intelligent retrieval risk falling behind. The NVIDIA AI Blueprint for RAG isn’t just an incremental update—it’s a fundamental shift toward scalable, multimodal, and high-performance retrieval that future-proofs enterprise AI strategies. It can be used as is, or combined with other NVIDIA Blueprints, such as the blueprints for digital humans or AI assistants, enabling organizations to build even more sophisticated solutions.

Explore NeMo Retriever microservices on the API catalog to develop enterprise-ready, information retrieval systems that generate context-aware responses from large collections of multimodal data. NeMo Retriever microservices are also now available on AWS SageMaker, Google Cloud Provider GKE, and Azure Marketplace.

Ready for enterprise deployment? Request a 90-day free trial for NVIDIA AI Enterprise and unlock the next era of production-ready AI-driven retrieval.

NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster

Inside the blueprint: Fast data extraction and accurate retrieval

Multimodal data extraction at scale

Faster and more accurate retrieval

Advanced enterprise capabilities

Revolutionizing enterprises and data platforms with RAG

Get started future-proofing your enterprise with RAG powered by NVIDIA NeMo Retriever

Related resources

Tags

About the Authors

NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster

Inside the blueprint: Fast data extraction and accurate retrieval

Multimodal data extraction at scale

Faster and more accurate retrieval

Advanced enterprise capabilities

Revolutionizing enterprises and data platforms with RAG

Get started future-proofing your enterprise with RAG powered by NVIDIA NeMo Retriever

Related resources

Tags

About the Authors

Comments

Related posts

Top Posts of 2024 Highlight NVIDIA NIM, LLM Breakthroughs, and Data Science Optimization

Accelerating Oracle Database Generative AI Workloads with NVIDIA NIM and NVIDIA cuVS

Build an Enterprise-Scale Multimodal PDF Data Extraction Pipeline with an NVIDIA AI Blueprint

Translate Your Enterprise Data into Actionable Insights with NVIDIA NeMo Retriever

Scaling Enterprise RAG with Accelerated Ethernet Networking and Networked Storage

Related posts

MONAI Integrates Advanced Agentic Architectures to Establish Multimodal Medical AI Ecosystem

Seamlessly Scale AI Across Cloud Environments with NVIDIA DGX Cloud Serverless Inference

Improve AI Code Generation Using NVIDIA AgentIQ Open-Source Toolkit

Build Enterprise AI Agents with Advanced Open NVIDIA Llama Nemotron Reasoning Models

Maximize AI Agent Performance with Data Flywheels Using NVIDIA NeMo Microservices