使用 NVIDIA AI 端點和 Ragas 對醫療 RAG 的評估分析

在快速發展的醫學領域，尖端技術的集成對于增強患者護理和推進研究至關重要。其中一項創新是檢索增強生成（RAG），它正在改變醫療信息的處理和使用方式。

RAG 將大語言模型 (LLMs) 的功能與外部知識檢索相結合，解決了信息過時和生成不準確數據（稱為“幻覺”）等關鍵限制。通過從結構化數據庫、科學文獻和患者記錄中檢索最新的相關信息，RAG 為醫療應用提供了更準確、更符合情境感知的基礎。這種混合方法提高了生成輸出的準確性和可靠性，并增強了可解釋性，使其成為藥物研發和臨床試驗篩選等領域的重要工具。

隨著我們繼續探索 RAG 在醫學領域的潛力，必須嚴格評估其性能，同時考慮檢索和生成組件，以確保在醫療應用中實現更高的準確性和相關性標準。醫療 RAG 系統具有獨特的需求和要求，這凸顯了對全面評估框架的需求，這些框架可以有力地解決這些問題。

在本文中，我將向您展示如何使用 LangChain NVIDIA AI 端點和 Ragas 應對醫療評估挑戰。您將使用 MACCROBAT 數據集，這是一個來自 PubMed Central 的詳細患者醫療報告數據集，其中包含精心注釋的信息。

醫療 RAG 的挑戰?

可擴展性是一項主要挑戰。隨著 medical data grows at a CAGR of >35%，RAG 系統必須在不影響速度或準確性的情況下高效處理和檢索相關信息。這在實時應用中至關重要，因為及時訪問信息會直接影響患者護理。

醫療應用程序所需的特定語言和知識可能與其他領域（例如法律或金融領域）大不相同，這限制了系統的通用性，并需要對特定領域進行調整。

另一個關鍵挑戰是缺乏醫療 RAG 基準，以及該領域通用的評估指標不足。缺乏基準需要根據醫療文本和健康記錄生成合成測試和真實數據。

BLEU 或 ROUGE 等傳統指標專注于文本相似性，無法充分捕捉 RAG 系統的細微性能。這些指標通常無法反映生成內容的事實準確性和上下文相關性，而事實準確性和上下文相關性在醫療應用中至關重要。

最后，評估 RAG 系統還需要獨立地評估檢索和生成組件，以及整體評估。檢索組件必須評估其從龐大且動態的知識庫中獲取相關和最新信息的能力。這包括測量精度、召回率和相關性，同時還考慮信息的時間方面。

生成組件由大語言模型提供支持，必須評估其生成內容的真實性和準確性，確保其與檢索到的數據和原始查詢保持一致。

總體而言，這些挑戰凸顯了對全面評估框架的需求，這些框架可以滿足醫療 RAG 系統的獨特需求，確保這些系統提供準確、可靠且適合上下文的信息。

什么是 Ragas？

Ragas (檢索增強型生成評估) 是一種熱門的開源自動評估框架，旨在評估 RAG 工作流。

Ragas 框架提供了用于評估這些流程性能的工具和指標，重點關注上下文相關性、上下文召回、忠實性和答案相關性等方面。它采用 LLM-as-a-judge 進行無參考評估，從而最大限度地減少對人工標注數據的需求，并提供類似人工標注的反饋，從而提高評估流程的效率和成本效益。

RAG 評估策略?

RAG 穩健評估的典型策略如下所示：

根據向量存儲中的文檔生成一組合成生成的三元組（問題-答案-上下文）。
通過在 RAG 中運行每個樣本問題，并將響應和上下文與基準真值進行比較，為每個樣本問題運行評估精度/召回指標。
過濾掉低質量的合成樣本。
在實際 RAG 上運行示例查詢，并使用合成上下文和響應作為真值的指標進行評估。

Diagram shows a question, such as ‘What are typical BP measurements in the case of congestive heart failure?” The system asks whether the retrieved context is relevant to the question and then whether the retrieved context contains information relevant to the question. The response might start with something like, “While normal blood pressure is generally considered below 120/80 mmHg, heart failure patients often require careful management within a target range….” The system then asks if the response is accurate and relevant to the question. — *圖 1. RAG 和搜索系統的評估組件流*

要充分利用本教程，您需要了解 LLM 推理管道的基本知識。

設置?

首先，使用 NVIDIA API Catalog 創建一個免費帳戶，并按照以下步驟操作：

選擇任意型號。
選擇 Python， 獲取 API 密鑰 。
將生成的密鑰另存為 NVIDIA_API_KEY、

從那里，您應該可以訪問端點。

現在，安裝 LangChain、NVIDIA AI 端點和 Ragas：

pip install langchain
pip install langchain_nvidia_ai_endpoints
pip install ragas

下載醫療數據集?

接下來，下載 Kaggle MACCROBAT 數據集。您可以直接從 Kaggle 下載數據集（需要 Kaggle API 令牌），也可以使用 Hugging Face 的 /MACCROBAT_biomedical_ner 版本。

在這篇博文中，您使用醫療報告的完整文本，而忽略 NER 注釋：

from langchain_community.document_loaders import HuggingFaceDatasetLoader
from datasets import load_dataset
 
dataset_name = "singh-aditya/MACCROBAT_biomedical_ner"
page_content_column = "full_text"
 
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)
dataset = loader.load()

生成合成數據?

RAG 評估的主要挑戰之一是生成合成數據。這是進行穩健評估所必需的，因為您想在與向量數據庫中的數據相關的問題上測試 RAG 系統。

此方法的一個主要優勢是，它支持廣泛測試，同時不需要昂貴的人工標注數據。一組 LLMs（generator，critic，embedding）用于根據相關數據生成代表性合成數據。Ragas 默認使用 OpenAI，因此您可以覆蓋此選項，轉而使用 NVIDIA AI 端點。

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
 
critic_llm = ChatNVIDIA(model="meta/llama3.1-8b-instruct")
generator_llm = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1")
embeddings = NVIDIAEmbeddings(model="nv-embedqa-e5-v5", truncate="END")
 
generator = TestsetGenerator.from_langchain(
     generator_llm,
     critic_llm,
     embeddings,
     chunk_size=512
)
 
# generate testset
testset = generator.generate_with_langchain_docs(dataset,  test_size=10, is_async=False, raise_exceptions=False, distributions={simple: 1.0})

Diagram showing an evaluation pipeline for medical RAG, consisting of input from the EHR database; synthetic questions, answers, and contexts; and output and metrics. — *圖 2 醫療 RAG 評估系統流程圖*

根據 MACCROBAT 數據集中的醫療報告，在向量存儲上部署代碼。這將根據向量存儲中的實際文檔生成樣本問題列表。

[“What are typical BP measurements in the case of congestive heart failure?”,
“What can scans reveal in patients with severe acute pain in the periumbilical region?”
“Is surgical intervention required for the treatment of a metachronous solitary liver metastasis?”  
“What are the most effective procedures for detecting gastric cancer?”]

此外，每個問題都與檢索到的上下文和生成的真值答案相關聯，您稍后可以使用這些答案獨立評估和分級醫療 RAG 的檢索和生成組件。

評估輸入數據

現在，您可以將合成數據用作評估的輸入數據。使用生成的問題（question）和回答（ground_truth），以及從醫療 RAG 系統檢索到的實際上下文（contexts）及其相應回答（answer），填充輸入數據。

在此代碼示例中，您將評估特定于生成的指標（answer_relevancy，faithfulness）。

   # answer relevance and faithfulness metrics ignore ground truth, so just fill it with empty values
    ground_truth = ['']*len(queries)
    answers = []
    contexts = []
 
    # Run queries in search endpoint and collect context and results 
    for query in queries:
        json_data = query_rag(query)
 
        response =json_data['results'][0]['answer']
        answers.append(response)
   
        seq_str = []
        seq_str.append(json_data['results'][0]['retrieved _document_context'])
        contexts.append(seq_str)
 
    # Store all data in HF dataset for RAGAS
    data = {
        "question": queries,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truth
    }
    dataset= DatasetDict()
    dataset['eval']=Dataset.from_dict(data)
 
    # Override OpenAI LLM and embedding with NVIDIA AI endpoints
    nvidia_llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")    
nvidia_embeddings = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5", truncate="END")
 
   result = evaluate(
            dataset["eval"],
            metrics=[answer_relevancy, 
                faithfulness
                ],
            llm=nvidia_llm,
            embeddings=nvidia_embeddings,
            raise_exceptions=False,
            is_async=False,
        )

適用于語義搜索?

您可以進一步修改系統，根據關鍵字 (而非問答對) 評估語義搜索。在這種情況下，您可以從 Ragas 中提取關鍵詞，而忽略已生成的測試集的問答數據。這在尚未部署完整 Ragas 工作流的醫療系統中通常很有用。

testset = generator.generate_with_langchain_docs([doc], test_size=10, is_async=False, raise_exceptions=False, distributions={simple: 1.0})
        queries = []
        for node in generator.docstore.nodes:
            queries += node.keyphrases
        return queries

現在，您可以將查詢（而非問題）輸入任何醫學語義搜索系統進行評估：

[“lesion”, “intraperitoneal fluid”, “RF treatment”, “palpitations”, “thoracoscopic lung biopsy”, “preoperative chemoradiotherapy”, “haemoglobin level”, “needle biopsy specimen”, “hypotension”, “tachycardia”,  “abdominal radiograph”, “pneumatic dilatation balloon”, “computed tomographic (CT) scan”, “tumor cells“, “radiologic examinations“, “S-100 protein“, “ultrastructural analysis”, “Birbeck granules”, “diastolic congestive heart failure (CHF)”, “Brachial blood pressure”, “ventricular endomyocardial biopsy”, “myocarditis”, “infiltrative cardiomyopathies”, “stenosis”, “diastolic dysfunction”,  “autoimmune hepatitis”]

自定義語義搜索?

如前所述，默認評估指標并不總是足以滿足醫療系統的需求，因此通常必須進行定制以支持特定領域的挑戰。

為此，您可以在 Ragas 中創建自定義指標。這需要創建自定義提示。在本例中，您創建自定義提示來衡量語義搜索查詢的檢索精度：

RETRIEVAL_PRECISION = Prompt(
    name="retrieval_precision",
    instruction="""if a user put this query into a search engine, is this result relevant enough that it could be in the first page of results? Answers should STRICTLY be either '1' or '0'. Answer '0' if the provided summary does not contain enough information to answer the question and answer '1' if the provided summary can answer the question.""",
    input_keys=["question", "context"],
    output_key="answer",
    output_type="json",
)

接下來，構建一個繼承自 MetricWithLLM 的新類，并覆蓋 _ascore 函數，根據提示響應計算分數：

@dataclass
class RetrievalPrecision(MetricWithLLM):
 
    name: str = "retrieval_precision"  # type: ignore
    evaluation_mode: EvaluationMode = EvaluationMode.qc  # type: ignore
    context_relevancy_prompt: Prompt = field(default_factory=lambda: RETRIEVAL_PRECISION)
 
    async def _ascore(self, row: t.Dict, callbacks: Callbacks, is_async: bool) -> float:
        score=response[0] # first token is the result [0,1]
        if score.isnumeric():
            return int(score)
        else:
            return 0
 
    retrieval_precision = RetrievalPrecision()

現在，新的自定義指標定義為 retrieval_precision，您可以在標準 Ragas 評估管道中使用它：

nvidia_llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
nvidia_embeddings = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5", truncate="END")
 
score = evaluate(dataset["eval"], metrics=[retrieval_precision], llm=nvidia_llm, embeddings=nvidia_embeddings, raise_exceptions=False, is_async=False)

使用結構化輸出進行優化?

RAG 和 LLM 評估框架采用 LLM 即判斷技術，通常需要長而復雜的提示。正如您在之前的自定義指標提示示例中所看到的，這還需要對 LLM 響應進行解析和后處理。

您可以使用 LangChain NVIDIA AI 端點的結構化輸出功能來改進此過程，使其更加穩健。修改之前的提示會生成一個簡化的管道：

import enum
 
class Choices(enum.Enum):
    Y = "Y"
    N = "N"
 
structured_llm = nvidia_llm.with_structured_output(Choices)
 
structured_llm.invoke("if a user put this query into a search engine, is this result relevant enough that it could be in the first page of results? Answer 'N' if the provided summary does not contain enough information to answer the question and answer 'Y' if the provided summary can answer the question.")

結束語?

RAG 已成為一種強大的方法，它將大型語言模型和密集向量表示的優勢相結合。通過使用密集向量表示，RAG 模型可以高效擴展，非常適合大型企業應用，例如多語種客戶服務聊天機器人和代碼生成代理。

隨著大型語言模型（LLMs）的不斷發展，RAG 將在推動創新和提供高質量、智能的醫療系統方面發揮越來越重要的作用。

在評估醫療 RAG 系統時，必須考慮幾個關鍵因素：

The system should provide accurate, relevant, and up-to-date information while remaining faithful to the retrieved context.
它必須在處理專門的醫學術語和概念以及嘈雜或不完善的輸入方面表現出穩健性。
正確的評估包括對檢索和生成組件使用適當的度量指標，與專門的醫療數據集進行基準測試，并考慮成本效益。
整合醫療健康專業人員的反饋并進行持續評估對于確保該系統在臨床環境中的實用性和相關性至關重要。

本文中描述的管道解決了所有這些問題，并且可以進一步完善以包括其他指標和特征。

有關使用 Ragas 的參考評估工具的更多信息，請參閱 NVIDIA/GenerativeAIExamples GitHub 倉庫上的評估示例。

使用 NVIDIA AI 端點和 Ragas 對醫療 RAG 的評估分析

醫療 RAG 的挑戰?

什么是 Ragas？

RAG 評估策略?

設置?

下載醫療數據集?

生成合成數據?

評估輸入數據

適用于語義搜索?

自定義語義搜索?

使用結構化輸出進行優化?

結束語?

相關資源

標簽

關于作者

使用 NVIDIA AI 端點和 Ragas 對醫療 RAG 的評估分析

醫療 RAG 的挑戰?

什么是 Ragas？

RAG 評估策略?

設置?

下載醫療數據集?

生成合成數據?

評估輸入數據

適用于語義搜索?

自定義語義搜索?

使用結構化輸出進行優化?

結束語?

相關資源

標簽

關于作者

相關文章

使用重排序微服務提升信息檢索準確性和降低成本

使用 NVIDIA NIM 增強 RAG 應用

相關文章

使用 NVIDIA NeMo 微服務，通過數據飛輪增強 AI 智能體

聚焦：Qodo 借助 NVIDIA DGX 實現高效代碼搜索創新

使用 NVIDIA NIM 構建 AI 驅動的自動引用驗證工具

宣布推出基于 CUDA 評估 LLM 的開源框架 ComputeEval

NVIDIA Llama Nemotron 超開放模型實現突破性的推理準確性