圖像和視頻理解的視覺語言模型提示工程實踐指南

視覺語言模型 (VLMs) 正在以極快的速度發展。2020 年，首批 VLMs 通過使用視覺編碼器將視覺理解引入大語言模型 (LLMs) ，徹底改變了生成式 AI 格局。這些初始 VLMs 能力有限，只能理解文本和單張圖像輸入。

幾年后，VLM 現在能夠理解多圖像和視頻輸入，以執行高級視覺語言任務，例如視覺問答 (VQA)、字幕、搜索和摘要。

A graph showing different techniques for improving VLM accuracy as data and compute scales up, including prompt engineering, prompt learning, PEFT, and fine-tuning.

圖 1、NVIDIA NeMo 中的模型自定義工具套件

通過調整提示和模型權重，可以提高特定用例的 VLM 準確性。可以使用 PEFT 等高效技術微調模型權重，但仍需要足夠的數據和計算。

但是， prompt engineering 和上下文學習也可用于提高輸出質量，并且與在運行時調整模型的文本輸入一樣簡單。

本文將為您介紹 VLM 的演變，以及如何有效地提示 VLM 以用于了解單張圖像、多張圖像和視頻的用例。如需了解有關 VLM 和視覺 AI 智能體的更多信息，請注冊參加即將舉辦的 Vision for All：Unlocking Video Analytics with AI Agents 在線研討會。

單圖像理解

從單張圖像來看，VLM 能夠識別、描述、分類和推理圖像中的內容。一些 VLM 具有高級功能，例如為某些對象提供邊界框坐標，以及翻譯圖像中的文本。這最適合需要對圖像進行分類、添加字幕或以文本進行描述的用例。

單圖像功能還可用于檢測直播中發生的基本事件。這可以通過以下方式實現：從直播中一次采樣一幀，并將其提供給 VLM，同時提供描述要查找的內容的提示。當要檢測的事件類型較大、清晰可見且移動緩慢時，此方法的效果尤為出色。

系統還可以提示 VLM 以 JSON 等結構化格式輸出，以便輕松解析響應并將其發送至數據庫或通知服務。

例如，我們為 Llama-3.2-90b-vision-instruct 提供了以下提示和圖像（圖 2）。

Input image showing a drone footage with two firetrucks and a fire with smoke. This image is used as input to the VLM model.

圖 2、將圖像輸入 VLM 以進行問答

提示：是否有消防車？是否有火災？有消防人員嗎？輸出每個問題的答案，以 JSON 格式。

VLM 響應：

[ { "question": "Is there a fire truck?", "answer": "True" }, 
  { "question": "Is there a fire?", "answer": "True" }, 
  { "question": "Are there firefighters?", "answer": "True" } 
]

對于每個問題，VLM 能夠準確回答并以結構化格式輸出，以便用于下游任務。

雖然此方法允許 VLM 通過一次采樣一幀來操作視頻輸入，但 VLM 本身無法理解視頻的完整上下文，將其局限于簡單的用例。一次對一幀進行采樣無法為其提供檢測隨時間變化的動作或趨勢所需的時間理解。這可以通過多圖像和視頻 VLM 得到改善。

多圖像理解

許多 VLM 可以接受多張圖像作為輸入，允許模型比較、對比并從輸入圖像中學習。

VLM 通常在各種用例中都具有良好的零樣本擊能力。但是，在嘗試從 VLM 獲取有關特定領域用例的更精確答案時，VLM 可能會遇到困難，因為它們缺乏準確響應用戶提示所需的完整上下文。

為了更好地理解這一點，我們使用 VLM 來估計產品的庫存水平。這是一個重要的零售用例，因為它可以在商店中的產品運行不足時自動監控和提醒員工，以避免因缺貨而損失的收入。

我們向 Cosmos Nemotron 34B 提供了一張表格 (圖 3) 的單一圖像，該表格應與多種零食產品一起充分庫存，并要求其估算庫存水平。

An image showing a table partially stocked with various snack products. Several boxes and shelves on the table are empty.

圖 3. 用于零售庫存水平檢測的單圖像 VLM 輸入

提示：按 0 – 100 的比例估算零食表的庫存量。

VLM 響應：

The snack table appears to be well-stocked, with a variety of snacks and treats. The shelves are filled with an assortment of items, including chips, cookies, and other snacks. The table is also neatly organized, with each item placed in a designated spot. Based on the visual, I would rate the stock level at around 80-90, indicating that the table is well-stocked and ready to cater to a crowd.

該模型將其評為 80–90% 的庫存，即使肉眼很清楚庫存水平很低。從模型的角度來看，它不知道在查看單個圖像之前表格的樣子。

通過使用 Cosmos Nemotron 34B 多圖像功能，我們提供了另一張 100%庫存水平的零食表圖像 (圖 4)，為模型提供了額外的背景信息，以便生成更準確的估計。

Two images show a snack table at 100% stock level and the other at a 25% stock level.?

圖 4. 為 VLM 提供情境學習輸入，用于零售庫存水平檢測

提示：首先比較兩張圖像的庫存水平。然后，按 0 – 100% 的比例對庫存量的每張圖像生成估計值。

VLM 響應：

The first image shows a well-stocked display with a variety of snacks, while the second image shows a more sparse display with only a few items. Based on the visual comparison, I estimate that the first image has a stock level of around 80-90%, while the second image has a stock level of around 20-30%.

現在，通過提供額外的圖像作為參考，VLM 可將第二張圖像的庫存量估算為 20–30%，準確度明顯更高。

多圖像輸入還可用于更復雜的多模態 RAG 工作流，其中會根據用戶的輸入將示例圖像動態添加到提示中。多圖像 VLM 還可以通過從視頻中采樣一小部分幀并將其提供給 VLM 來用于基本的視頻理解。

但是，許多接受多圖像輸入的 VLM 的上下文窗口有限（10–20 幀），并且可能未基于視頻數據進行訓練。為了真正理解視頻輸入，我們建議使用基于視頻特定數據集訓練的 VLM 和長上下文窗口。

視頻理解

最新的 VLM 具有完整的視頻理解功能，具有大上下文長度、幀壓縮和視頻訓練數據。這使得 VLM 能夠提取更長的視頻，并提高時間理解能力：能夠理解一段時間內的行動和趨勢。

第一個示例展示了如何使用直播中的單幀來檢測火災，但由于缺乏時間背景，相同的方法無法告訴您火災是在增大還是變小。

具有長上下文和視頻理解的 VLM 會隨著時間的推移處理多個幀，以了解火勢是否在增長。以下是 Cosmos Nemotron 模型系列的一些示例。

順序視覺理解

具有順序理解的 VLMs 依賴于按序列處理幀，使其能夠捕捉動作和事件隨時間推移的進展。與多圖像模型相比，它們在處理有限數量的幀時沒有顯式時間連接，而順序視頻模型則通過按順序分析幀來整合時間動力學。

這在您想要在視頻中捕捉事件因果關系的情況下非常有用，因為它使模型能夠跟蹤動作在一段時間內的進展情況，并建立因果關系。

接下來的幾個示例基于以下視頻，展示了一名工人在倉庫中丟棄箱子的過程。我們將討論一些提示和響應。

Warehouse video where a worker enters the scene and accidentally drops a box. You can also see multiple aisles and shelves in the background.

圖 5、將倉庫視頻輸入到 VLM 以進行場景理解

提示：本視頻中發生了什么？

VLM 響應：

The video shows a warehouse worker in a yellow safety vest and hard hat performing various tasks. The worker is seen walking through the warehouse, carrying boxes, and interacting with the shelving units. The warehouse has green metal shelving with labels like "C", "D", "E", and "F" on the aisles.
 
The worker appears to be in the process of organizing or retrieving items from the shelves. At one point, they're seen bending down to pick up a box from the floor. The worker's movements suggest they're actively engaged in their duties, possibly sorting, stocking, or preparing items for shipment.
 
The warehouse environment is well-lit, with fluorescent lighting illuminating the space. The worker's safety gear, including the reflective vest and hard hat, indicates adherence to safety protocols in this industrial setting.
 
Overall, the video captures a typical day in the life of a warehouse worker, showcasing the physical nature of the job and the organized layout of the storage facility.

該模型提供了詳細的環境細節 (例如通道標簽和照明條件)，但無法正確識別工人掉落的箱子。現在，嘗試直接問這個問題。Prompt：工作人員是否丟棄了任何箱子？

VLM 響應：

Yes, the worker did drop a box. In the video, we can see a box lying on the floor in front of the worker. This suggests that the worker has accidentally dropped the box while handling it.

現在，VLM 可以推斷出工作者可能丟棄了箱子。這表明，除非您指示 VLM 查找特定內容，否則通用提示可能無法正確捕捉所有細微差別。稍微修改一下提示。

提示：詳細介紹此倉庫視頻的視覺和敘事元素。突出顯示所有異常或危險（如果存在）。

VLM 響應：

The video captures a warehouse with tall metal shelving units labeled with letters 'C', 'D', 'E', and 'F'. The shelves are stocked with various items, including boxes and packages. The floor is concrete, and the lighting is artificial, typical of an indoor industrial setting. A worker wearing a high-visibility vest, a hard hat, and dark pants is seen walking down the aisle between the shelves. The worker is carrying a cardboard box and then drops the box on the floor.

現在，該模型可以捕捉場景的所有重要細節以及重要事件，例如工人丟箱。如果您想稍后在視頻中執行 Q&A，詳細的回答非常有用。具有視頻理解功能的 VLM 還能夠解決定向問題，因為它們可以理解隨時間推移的動作。

提示：工作者是從哪個方向進入場景的？

VLM 響應：

The worker entered the scene from the right side.

但是，這些模型通常缺乏精確的時間定位，這可能會限制其確定特定事件何時發生的能力。

提示：工人是什么時候丟棄盒子的？

VLM 響應：

The worker dropped the box in the second image.

時間定位

結合 Localization Interpretable Temporal Attention (LITA) 或類似時間定位技術的 VLM 通過顯式學習關鍵事件發生的時間和地點來提高視頻理解能力。這些模型通過引入專注于視頻關鍵片段的時間注意力機制，隨著時間的推移更有效地調整視覺和文本模式來實現這一目標。這有助于回答視頻中的“何時”問題。

提示： 工人是什么時候丟棄盒子的？

VLM 響應：

0:04-0:06 The worker dropped the box between 4s and 6s. During this time, the worker is seen dropping the box on the ground.

VLM 的最佳提示格式取決于模型的架構和訓練期間使用的描述對的性質。不同的訓練數據集會影響 VLM 解釋提示的方式。

結束語

本文介紹了 VLM 如何從僅支持單圖像輸入發展為能夠對長視頻輸入進行復雜的時間推理。要開始使用 VLM，請訪問 build.nvidia.com 并嘗試本博文中顯示的一些提示。有關技術問題，請參閱 Visual AI Agent 論壇。

VLM 可用于構建各種視覺 AI 智能體。無需 GPU，即可在 /NVIDIA/metropolis-nim-workflows GitHub 存儲庫上探索視覺 AI 工作流示例之一。要構建視頻分析 AI 智能體，請試用 NVIDIA AI Blueprint 中的視頻搜索和摘要藍圖。

如需了解有關 VLM 和視覺 AI 智能體的更多信息，請注冊參加即將舉辦的 Vision for All：Unlocking Video Analytics with AI Agents 網絡研討會。

有關 LLM 提示的更多信息，請參閱《 An Introduction to Large Language Models: Prompt Engineering and P-Tuning 》。

圖像和視頻理解的視覺語言模型提示工程實踐指南

單圖像理解

多圖像理解

視頻理解

順序視覺理解

時間定位

結束語

相關資源

標簽

關于作者

圖像和視頻理解的視覺語言模型提示工程實踐指南

單圖像理解

多圖像理解

視頻理解

順序視覺理解

時間定位

結束語

相關資源

標簽

關于作者

相關文章

Llama 3.2 加速部署從邊緣到云端實現提速

相關文章

聚焦：個人 AI 借助 NVIDIA Riva 為小企業主帶來 AI 接待員

使用 NVIDIA AI Blueprint 構建實時多模態 XR 應用以進行視頻搜索和摘要

使用 NetworkX、Jaccard Similarity 和 cuGraph 預測您下一部最喜歡的電影

使用 GPU 在 Apache Spark 上加速 JSON 處理

構建生成式 AI OpenUSD 應用，呈現準確品牌的營銷視覺效果