Latest Multimodal Addition to Microsoft Phi SLMs Trained on NVIDIA GPUs

Large language models (LLMs) have permeated every industry and changed the potential of technology. However, due to their massive size they are not practical for the current resource constraints that many companies have.

The rise of small language models (SLMs) bridge quality and cost by creating models with a smaller resource footprint. SLMs are a subset of language models that tend to focus on specific domains and are built with simpler neural architectures. As models grow to mimic how humans perceive the world around them, models must rise to accept multiple forms of multimodal data.

Microsoft announces the new generation of open SLMs to the Phi family with two new additions:

Phi-4-mini
Phi-4-multimodal

Phi-4-multimodal is the first multimodal model to join the family that accepts text, audio, and image data inputs.

These models are small enough for on-device deployment. This release builds on top of the December 2024 research-only release of the Phi-4 14B parameter SLM and enables commercial use for the two new smaller models.

The new models are available on the Azure AI Foundry, Microsoft’s Cloud AI platform for design, customize, and manage AI applications and agents.

You can test out each member of the Phi family through the NVIDIA API Catalog, which is the first sandbox environment to support each modality and tool-calling for Phi-4-multimodal. Use the preview NIM microservice to integrate the model into your applications today.

Why invest in SLMs?

SLMs enable generative AI capabilities in memory and compute constrained environments. For example, SLMs can be deployed directly on smartphones and several consumer-grade devices. On-device deployment can facilitate privacy and compliance for use cases that must adhere to regulatory requirements.

Other benefits of SLMs include lower latency due to inherently faster inference compared to an LLM of similar quality. SLMs do tend to perform better on specialized tasks correlated to their training data. However, to supplement generalization and adaptability to different tasks, you can use retrieval-augmented generation (RAG) or native-function calling to build performant agentic systems.

Phi-4-multimodal

Phi-4-multimodal is with 5.6B parameters and accepts audio, image, and text reasoning. This enables it to support use cases such as automated speech recognition (ASR), multi-modal summarization, translation, OCR, and visual reasoning. This model was trained on 512 NVIDIA A100-80GB GPUs over 21 days.

Figure 1 shows how you can preview your image data and ask Phi-4-multimodal visual QA in the NVIDIA API Catalog. You can also see how to adjust parameters such as token limits, temperature, and sampling values. You can generate sample code in Python, JavaScript, and Bash to help you integrate the model more easily into your applications.

The GIF shows the steps of uploading an image to use Phi for visual QA. An image of a dog and a chat is uploaded and the user asks for a description. The chat window states “There is a cat and a dog”. The user scrolls down to see which parameters such as temperature are editable. The user scrolls down more to find the generated Python code. — *Figure 1. Visual QA demo in NVIDIA API Catalog*

You can also demo tool calling with a set of prebuilt agents. Figure 2 shows a tool that retrieves live weather data.

The GIF shows the steps of using the NVIDIA Phi NIM in a chat window with a Tools box underneath that enables a prebuilt agent titled “get_current_weather”. This prebuilt agent returns information from a live weather service. The user types, “What is the weather in Houston, TX?” and sees a customized tools response box with JSON for the tool call, which then returns weather and humidity information. — *Figure 2. Tool-calling demo in NVIDIA API Catalog*

Phi-4-mini

Phi-4-mini is a text-only, dense, decoder-only Transformer model with 3.8B parameters that is optimized for chat. It includes a long-form context window of 128K tokens. This model was trained on 1024 NVIDIA A100 80GB GPUs over 14 days.

For both models, the training data is intentionally focused on high quality educational data and code which results in a textbook-like quality to the models. Text, speech, and vision benchmark data can be found in the model cards.

Advancing community models

NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open-source licenses. NVIDIA is committed to optimizing community software and open models such as Phi which promotes AI transparency and lets users broadly share work in AI safety and resilience.

Using the NVIDIA NeMo platform, these open models can be customized on proprietary data to be highly tuned and efficient for diverse AI workflows across any industry.

NVIDIA and Microsoft have a long standing partnership which includes several collaborations driving innovation on GPUs on Azure, integrations and optimizations for PC developers using NVIDIA RTX GPUs, and many more, including research spanning generative AI to healthcare and life sciences.

Get started today

Bring your data and try out Phi-4 on the NVIDIA-accelerated platform at build.nvidia.com/microsoft.

On the first multi-modal sandbox for Phi-4-multimodal, you can try out text, image, and audio as well as sample tool calling to see how this model will work for you in production.

Latest Multimodal Addition to Microsoft Phi SLMs Trained on NVIDIA GPUs

Why invest in SLMs?

Phi-4-multimodal

Phi-4-mini

Advancing community models

Get started today

Related resources

Tags

About the Authors

Latest Multimodal Addition to Microsoft Phi SLMs Trained on NVIDIA GPUs

Why invest in SLMs?

Phi-4-multimodal

Phi-4-mini

Advancing community models

Get started today

Related resources

Tags

About the Authors

Comments

Related posts

Lightweight, Multimodal, Multilingual Gemma 3 Models Are Streamlined for Performance

Deploy Agents, Assistants, and Avatars on NVIDIA RTX AI PCs with New Small Language Models

Power Your AI Projects with New NVIDIA NIMs for Mistral and Mixtral Models

Unlock the Power of Small Language Model Phi-2 for Chat, Research, Coding, and More

Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models?

Related posts

Integrate and Deploy Tongyi Qwen3 Models into Production Applications with NVIDIA

HackAI Challenge Winners Announced

NVIDIA Blackwell and NVIDIA CUDA 12.9 Introduce Family-Specific Architecture Features

Boosting Matrix Multiplication Speed and Flexibility with NVIDIA cuBLAS 12.9

Stacking Generalization with HPO: Maximize Accuracy in 15 Minutes with NVIDIA cuML