在 NVIDIA NeMo Guardrails 中使用 Cleanlab 可信語言模型防止 LLM 幻覺

隨著越來越多的企業將 Large Language Models (LLM) 集成到其應用中，他們面臨著一個嚴峻的挑戰：LLM 可能會產生合理但錯誤的反應，即幻覺。AI 護欄 (即在 AI 模型和應用中執行的保護機制) 是確保 AI 應用可靠性的熱門技術。

本文將介紹如何使用 Cleanlab Trustworthy Language Model (TLM) 和 NVIDIA NeMo Guardrails 構建更安全、無幻覺的 AI 應用。

NVIDIA NeMo Guardrails 概述

NVIDIA NeMo Guardrails 是一個可擴展的平臺，用于在 AI 智能體和其他生成式 AI 應用中定義、編排和執行 AI 導軌或策略。它包含一套可定制且可擴展的內容安全、越獄檢測、對話式主題控制等操作指南。NeMo Guardrails 提供了一個統一的框架，用于集成和編排各種 AI 護欄，包括 NeMo Guardrails NIM 微服務，以及第三方和開放式社區護欄。

例如，NeMo Guardrails 通過 LLM 自我檢查為輸入和輸出文本提供安全檢查，以及 NVIDIA 的 Llama 3.1 NemoGuard Content Safety NIM 和 Meta 的 Llama Guard。這些檢查會根據定義的策略審核所有文本，并實時標記違反策略的行為。NeMo Guardrails 還集成了第三方護欄 (例如 ActiveFence ActiveScore ) ，為開發者提供了一個全面而靈活的安全工具包，可以結合不同的檢查來滿足獨特的應用程序需求。

Cleanlab 可信語言模型概述

NeMo Guardrails 框架基于由 Cleanlab 可信語言模型 (TLM) 提供支持的可信度評分，為護欄提供原生支持。TLM 使用先進的不確定性估計技術對任何 LLM 響應的可信度進行評分。如需了解詳情，請參閱使用可信語言模型克服幻覺。

TLM 可在各種企業用例中自動實時驗證 LLM 輸出，包括：

客戶支持系統，可在 AI 和人工智能體之間智能上報響應
支持檢索增強生成 (RAG) 的 AI 助手，可自動標記不信任的響應
自動化 LLM 系統，可更可靠地分類或傳遞信息或執行工具調用

如何將可信度護欄與 NeMo Guardrails 集成

本節將介紹如何使用客戶支持 AI 助手將 trustworthiness guardrail 與 NeMo Guardrails 集成，并提供幾個真實示例場景。

應用程序：客戶支持 AI 助手

為了演示如何將護欄與 NeMo Guardrails 集成，我們為一家電子商務公司構建了一個客戶支持 AI 助手。該助手旨在利用公司的政策文檔提供相關信息，為客戶查詢配送、產品退貨和退款事宜提供支持。

以下是用作 AI 助手上下文的政策文檔的摘錄。以紅色突出顯示的語句對應后續呈現場景中引用的政策的特定部分。

# ACME Inc. Customer Service Policy
 
 
    ## Table of Contents
    1. Free Shipping Policy
    2. Free Returns Policy
    3. Fraud Detection Guidelines
    4. Customer Interaction Tone
 
 
    ## 1. Free Shipping Policy
 
 
    ### 1.1 Eligibility Criteria
    - Free shipping is available on all orders over $50 within the continental United States.
    - For orders under $50, a flat rate shipping fee of $5.99 will be applied.
    - Free shipping is not available for expedited shipping methods 
      (e.g., overnight or 2-day shipping).
 
 
    ### 1.2 Exclusions
    - Free shipping does not apply to orders shipped to Alaska, Hawaii, 
      or international destinations.
    - Oversized or heavy items may incur additional shipping charges, which will be  
      clearly communicated to the customer before purchase.
 
 
    ### 1.3 Handling Customer Inquiries
    - If a customer inquires about free shipping eligibility, verify the order total and  
      shipping destination.
    - Inform customers of ways to qualify for free shipping  
      (e.g., adding items to reach the $50 threshold).
    - For orders just below the threshold, you may offer a one-time courtesy free shipping  
      if it's the customer's first purchase or if they have a history of large orders.
 
 
    ## 2. Free Returns Policy
 
 
    ### 2.1 Eligibility Criteria
    - Free returns are available for all items within 30 days of the delivery date.
    - Items must be unused, unworn, and in their original packaging with all tags attached.
    - Free returns are limited to standard shipping methods within the 
      continental United States.
 
 
    ### 2.2 Exclusions
    - Final sale items, as marked on the product page, are not eligible for free returns.
    - Customized or personalized items are not eligible for free returns unless there is a manufacturing defect.
    - Undergarments, swimwear, and earrings are not eligible for free returns 
      due to hygiene reasons.
 
 
    ### 2.3 Process for Handling Returns
    1. Verify the order date and ensure it falls within the 30-day return window.
    2. Ask the customer about the reason for the return and document it in the system.
    3. Provide the customer with a prepaid return label if they qualify for free returns.
    4. Inform the customer of the expected refund processing time 
       (5-7 business days after receiving the return).
 
 
    ### 2.4 Exceptions
    - For items damaged during shipping or with manufacturing defects, offer an 
      immediate replacement or refund without requiring a return.
    - For returns outside the 30-day window, use discretion based on the customer's history
      and the reason for the late return. You may offer store credit as a compromise.
 
 
    ## 3. Fraud Detection Guidelines
    ...

考慮以下場景。這家電子商務公司的一位客戶想要退貨一雙運動鞋，并詢問了該公司的退貨政策。AI 助手會掃描退貨政策并做出回應：“您可以在 30 天內退貨，我們會在檢查鞋子是否磨損后為您退款。”

這種反應聽起來可能合理，但如果退貨政策明確排除有磨損跡象的鞋子，該怎么辦？如果沒有適當的 guardrails，細微的誤解和遺漏可能會很快導致客戶不滿。

旨在確保模型響應可信度的護欄可以解決這一風險。每當 AI 助手生成響應時，TLM 都會對其進行評估，以確保其與策略和用戶查詢保持一致。

如果 TLM 檢測到未對齊或響應在其他方面不可信任，AI 助手不會返回這種不準確的響應，而是會返回一個備用響應，例如“I’m sorry, I am unable to help with this request。”

在觸發 guardrail 時，此類系統不會默認采用后備響應，而是可以升級為人工支持代理，而不是返回不可信任的 AI 響應。

在真實場景中進行評估

本節將探討可信賴護欄如何處理常見的客戶支持場景。下面的每個場景展示了幻覺檢測和響應驗證的不同方面，展示了護欄如何處理 LLM 響應和策略指南之間的不同級別的匹配。

Screenshot of the customer support AI assistant handling three unique support scenarios. — *圖 1。客戶支持 AI 助手處理三種獨特的支持場景*

場景 1：退款政策查詢

退款政策對于客戶支持至關重要，助手必須以準確的信息做出響應，以維護客戶的信任。

用戶查詢： 如果我對購買感到失望，即使沒有缺陷，也可以退款嗎？

LLM 回復：如果您的物品未使用或未穿戴，您可能有資格獲得退款。如果商品為最終銷售商品，可能無法獲得退款。

可信度評分：0.92 – 響應通過保持對策略文檔的嚴格遵守來獲得較高的可信度評分。

護欄檢查結果：Pass – 助手將經過驗證的 LLM 響應轉發給用戶。

此場景展示了助手在處理直接映射到策略的查詢時的行為。

場景 2：產品退貨查詢

產品退貨政策通常因產品類別而異，針對珠寶、electronics 或易損件等物品制定了特定規則。助手必須準確區分有效的策略解釋和可能具有誤導性的外推法。

用戶查詢： 如果一對鼻環不合適，我可以返回嗎？

LLM 回應： 很遺憾，根據我們的免費退貨政策，包括 nose rings 在內的內衣和 jewelery 無法退貨。

可信度評分：0.58 – 護欄的低可信度評分揭示了一個重要細節：該政策解決了耳環問題，但未提及鼻環。LLM 的嘗試觸發了可信度檢查。

護欄檢查結果： Fail – 助手將 fallback 響應返回給用戶。

在這里，guardrail 正確地將其識別為需要人工干預的情況，從而防止潛在錯誤政策解釋的傳播。

場景 3：一般信息請求

當客戶需要更直接的溝通或更快的解決方案時，請求客戶支持聯系信息的請求很常見。如果政策中缺少此類信息，助理不得生成錯誤的詳細聯系信息。相反，助手應該將客戶重定向到他們可以找到信息的地方。

用戶查詢： 有沒有辦法通過電話號碼而不是通過聊天聯系客服？

LLM 回復：我無法提供電話號碼或直接聯系信息，但如果您有任何問題，我可以為您提供幫助。

可信度評分：0.45 – 雖然 LLM 給出了看似恰當的回復 (盡管與策略的語氣準則不完全一致) ，但由于上下文中缺少聯系信息以及 LLM 的回復不確定，guardrail 給出的可信度評分較低。

結果： Guardrail check Fail – 助手將 fallback response 返回給用戶。

在最后一個場景中，護欄不僅僅是檢查偽造信息。即使謹慎，它也會驗證 LLM 的響應是否基于政策文檔。

實現

此 AI 應用的核心組件是 NeMo Guardrails 中的 Colang 定義，以獲得 Cleanlab 的可信度分數：

flow cleanlab trustworthiness
  $result = await CallCleanlabApiAction
  if $result.trustworthiness_score < 0.7
      bot response untrustworthy
      abort
 
flow bot respond untrustworthy
    bot say "I'm sorry, I am unable to help with this request. I'll connect you with another agent who can help..."

此配置執行以下操作：

調用 Cleanlab 的 TLM API，獲取 <prompt，response> 對的 trustworthiness 分數。</prompt，response>
將獲得的可信度分數與指定的閾值（threshold）進行比較，根據該值，它可以將 LLM 響應發送給用戶，也可以將其轉發給人類代理。

請注意，為不可信任的響應觸發的操作可以根據應用要求進行定制，從簡單的回退消息到復雜的 agentic 觸發器。

總結

NVIDIA NeMo Guardrails 為客戶支持助理等安全可靠的 LLM 應用提供強大的控制功能。借助 Cleanlab Trustworthy Language Model ，可信語言模型，開發者可以在構建基于 LLM 的應用時添加額外的保護措施，以解決幻覺和不可信任的響應問題。

Cleanlab 是 NVIDIA Inception 計劃的成員，該計劃旨在幫助初創公司加速各個階段的技術創新和業務發展。

如需了解詳情，請參閱以下資源：

運行 CSA 應用的代碼
NeMo-Guardrails 和 TLM 集成文檔
適用于教程和基準測試的 Cleanlab 文檔

在 NVIDIA NeMo Guardrails 中使用 Cleanlab 可信語言模型防止 LLM 幻覺

NVIDIA NeMo Guardrails 概述

Cleanlab 可信語言模型概述