Large language models (LLMs) have permeated every industry and changed the potential of technology. However, due to their massive size they are not practical for the current resource constraints that many companies have.
The rise of small language models (SLMs) bridge quality and cost by creating models with a smaller resource footprint. SLMs are a subset of language models that tend to focus on specific domains and are built with simpler neural architectures. As models grow to mimic how humans perceive the world around them, models must rise to accept multiple forms of multimodal data.
Microsoft announces the new generation of open SLMs to the Phi family with two new additions:
- Phi-4-mini
- Phi-4-multimodal
Phi-4-multimodal is the first multimodal model to join the family that accepts text, audio, and image data inputs.
These models are small enough for on-device deployment. This release builds on top of the December 2024 research-only release of the Phi-4 14B parameter SLM and enables commercial use for the two new smaller models.
The new models are available on the Azure AI Foundry, Microsoft’s Cloud AI platform for design, customize, and manage AI applications and agents.
You can test out each member of the Phi family through the NVIDIA API Catalog, which is the first sandbox environment to support each modality and tool-calling for Phi-4-multimodal. Use the preview NIM microservice to integrate the model into your applications today.
Why invest in SLMs?
SLMs enable generative AI capabilities in memory and compute constrained environments. For example, SLMs can be deployed directly on smartphones and several consumer-grade devices. On-device deployment can facilitate privacy and compliance for use cases that must adhere to regulatory requirements.
Other benefits of SLMs include lower latency due to inherently faster inference compared to an LLM of similar quality. SLMs do tend to perform better on specialized tasks correlated to their training data. However, to supplement generalization and adaptability to different tasks, you can use retrieval-augmented generation (RAG) or native-function calling to build performant agentic systems.
Phi-4-multimodal
Phi-4-multimodal is with 5.6B parameters and accepts audio, image, and text reasoning. This enables it to support use cases such as automated speech recognition (ASR), multi-modal summarization, translation, OCR, and visual reasoning. This model was trained on 512 NVIDIA A100-80GB GPUs over 21 days.
Figure 1 shows how you can preview your image data and ask Phi-4-multimodal visual QA in the NVIDIA API Catalog. You can also see how to adjust parameters such as token limits, temperature, and sampling values. You can generate sample code in Python, JavaScript, and Bash to help you integrate the model more easily into your applications.

You can also demo tool calling with a set of prebuilt agents. Figure 2 shows a tool that retrieves live weather data.

Phi-4-mini
Phi-4-mini is a text-only, dense, decoder-only Transformer model with 3.8B parameters that is optimized for chat. It includes a long-form context window of 128K tokens. This model was trained on 1024 NVIDIA A100 80GB GPUs over 14 days.
For both models, the training data is intentionally focused on high quality educational data and code which results in a textbook-like quality to the models. Text, speech, and vision benchmark data can be found in the model cards.
Advancing community models
NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open-source licenses. NVIDIA is committed to optimizing community software and open models such as Phi which promotes AI transparency and lets users broadly share work in AI safety and resilience.
Using the NVIDIA NeMo platform, these open models can be customized on proprietary data to be highly tuned and efficient for diverse AI workflows across any industry.
NVIDIA and Microsoft have a long standing partnership which includes several collaborations driving innovation on GPUs on Azure, integrations and optimizations for PC developers using NVIDIA RTX GPUs, and many more, including research spanning generative AI to healthcare and life sciences.
Get started today
Bring your data and try out Phi-4 on the NVIDIA-accelerated platform at build.nvidia.com/microsoft.
On the first multi-modal sandbox for Phi-4-multimodal, you can try out text, image, and audio as well as sample tool calling to see how this model will work for you in production.