Understanding Azure Phi-3

In this blog post Understanding Azure Phi-3 and how to use it across cloud and edge we will unpack what Azure Phi-3 is, why it matters, and how you can put it to work quickly and safely.

Think of Azure Phi-3 as a family of small, efficient language models designed to deliver strong reasoning and coding capabilities at a fraction of the cost and latency of giant models. If GPT-4-class models are your Swiss Army knife for the hardest problems, Phi-3 is the fast, lightweight tool you’ll reach for most days—especially when you care about response time, budget, and running close to your data.

What is Azure Phi-3

Azure Phi-3 is Microsoft’s family of small language models (SLMs) available through the Azure AI model catalog and serverless APIs. The family includes variants optimised for instruction following, coding, and even vision (depending on availability in your region). Sizes range roughly from a few billion to low-teens of billions of parameters, with short and extended context options (for example, 4K and up to 128K tokens, depending on the specific model build).

Phi-3 models are tuned for practical enterprise workloads: high-quality summarisation, question answering, structured extraction, code assistance, and as the engine in retrieval-augmented generation (RAG). They’re built to run efficiently in the cloud, on AKS, and even on capable edge devices.

The technology behind Phi-3

Under the hood, Phi-3 uses a modern Transformer decoder architecture with careful training and alignment. Three ideas make it stand out for real-world use:

Small model, big quality: Training centred on high-quality, curated and synthetic data improves reasoning and instruction-following without bloating parameter count.
Efficient inference: Quantisation (e.g., 8-bit, 4-bit) and runtime optimisations through ONNX Runtime, TensorRT-LLM, and other accelerators enable low-latency responses on modest GPUs and even some CPUs.
Alignment and safety: Instruction tuning, preference optimisation, and guardrails improve helpfulness and reduce unsafe outputs. Azure provides content filtering, abuse monitoring, and governance features around the models.

The result: a responsive, cost-efficient model that’s simple to scale and easy to embed in existing applications and pipelines.

When to choose Phi-3 vs very large models

Choose Phi-3 when you need low latency, lower cost, predictable throughput, and good quality for common tasks (summarisation, classification, basic reasoning, code assistance, RAG on your data).
Choose larger models when you need state-of-the-art reasoning on novel or highly complex tasks, multi-step tool use at scale, or the best possible accuracy regardless of cost.

Ways to use Phi-3 in Azure

Serverless API (Models as a Service): Deploy a Phi-3 variant from Azure AI Studio with a single click. You get a secure HTTPS endpoint with an OpenAI-compatible chat completions API.
Azure Kubernetes Service (self-host): Package a Phi-3 container and run it in your cluster for maximum control, VNet isolation, and custom autoscaling.
Edge and on-device: Run quantised Phi-3 via ONNX Runtime on Windows (DirectML), Linux, or capable embedded devices for offline or near-data scenarios.

Step-by-step: Get a serverless Phi-3 endpoint

Open Azure AI Studio and go to Model catalog. Search for Phi-3 (e.g., Phi-3-mini, Phi-3-small, Phi-3-medium).
Select a model variant and click Deploy. Choose Serverless API and confirm the default throughput settings.
Once deployed, open the Consume tab. You’ll see your endpoint URL, API key, and sample code.
Test in the Studio’s playground, then copy the sample into your app.

Call the API from Python

The serverless endpoint exposes an OpenAI-style chat completions API. The exact headers and URL are provided in the Consume tab; they typically follow this shape:

import os
import requests

endpoint = os.environ["AZURE_PHI3_ENDPOINT"]  # e.g., https://<id>.<region>.models.ai.azure.com
api_key = os.environ["AZURE_PHI3_KEY"]

headers = {
    "Content-Type": "application/json",
    # Depending on the endpoint instructions, either:
    # "Authorization": f"Bearer {api_key}",
    # or
    # "api-key": api_key,
}

payload = {
    "model": "Phi-3-mini-4k-instruct",  # your deployed model name
    "messages": [
        {"role": "system", "content": "You are a concise, helpful assistant."},
        {"role": "user", "content": "Summarise Azure Phi-3 in two sentences."}
    ],
    "temperature": 0.3,
}

resp = requests.post(f"{endpoint}/v1/chat/completions", headers=headers, json=payload, timeout=30)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])

Call the API from .NET (C#)

using System;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Text;
using System.Threading.Tasks;
using Newtonsoft.Json;

public class Phi3Client
{
    private readonly HttpClient _http = new HttpClient();
    private readonly string _endpoint;

    public Phi3Client(string endpoint, string apiKey, bool useBearerHeader = true)
    {
        _endpoint = endpoint.TrimEnd('/');
        if (useBearerHeader)
            _http.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", apiKey);
        else
            _http.DefaultRequestHeaders.Add("api-key", apiKey);
    }

    public async Task<string> ChatAsync(string prompt)
    {
        var body = new
        {
            model = "Phi-3-small-8k-instruct",
            messages = new object[]
            {
                new { role = "system", content = "You are a helpful assistant." },
                new { role = "user", content = prompt }
            },
            temperature = 0.2
        };

        var content = new StringContent(JsonConvert.SerializeObject(body), Encoding.UTF8, "application/json");
        var res = await _http.PostAsync($"{_endpoint}/v1/chat/completions", content);
        res.EnsureSuccessStatusCode();
        dynamic json = JsonConvert.DeserializeObject(await res.Content.ReadAsStringAsync());
        return (string)json.choices[0].message.content;
    }
}

RAG with Phi-3

Phi-3 shines when paired with your data. A typical RAG setup on Azure:

Ingest documents into Azure AI Search or a vector database (e.g., Azure Cosmos DB for MongoDB vCore with vectors).
Embed content using an embedding model (from the Azure model catalog).
Retrieve relevant passages at query time.
Augment the prompt with retrieved context and send to Phi-3.

Because Phi-3 is efficient, you can afford to run more RAG traffic with lower latency while keeping costs predictable.

Fine-tuning and customisation

LoRA/QLoRA fine-tuning: Use Azure AI Studio or Azure Machine Learning to fine-tune Phi-3 with a small set of high-quality examples to lock in tone, format, or domain-specific knowledge.
Prompt engineering: Start with a clear system prompt. Define output schemas and specify examples. Use temperature 0.1–0.3 for deterministic outputs.
Guardrails: Enable Azure safety filters. Add pattern checks and content validators (e.g., PII detection) in your app.

Sizing, performance, and cost tips

Match model size to task: Start with mini for lightweight tasks; move to small or medium if you need stronger reasoning or longer contexts.
Quantise for speed: INT8/INT4 quantisation can reduce latency and memory with minimal quality loss for many tasks.
Batching and streaming: Enable token streaming for faster first-token latency; batch requests in backend services to improve throughput.
Monitor tokens: Track input and output tokens to control cost and ensure prompts fit within the model’s context window.

Security and governance

Data handling: For serverless endpoints, review data retention settings in Azure AI Studio; configure them to meet your compliance requirements.
Network isolation: Use Private Link, VNet integration, and managed identities to control access and eliminate public exposure where possible.
Observability: Log prompts and outputs (redacting sensitive data), track safety filter hits, and set up alerts on error rates and latency.

Common pitfalls and how to avoid them

Overlong prompts: Trimming or summarising context improves quality and latency. Don’t rely on the model to ignore irrelevant text.
Unclear instructions: Use explicit system prompts and provide examples. For structured outputs, request JSON and validate it.
Assuming LLM parity: Phi-3 is strong for its size but not a drop-in replacement for the largest models on the hardest tasks. Test with your data.

Running Phi-3 on the edge

If you need offline inference or ultra-low latency near devices:

Export or obtain a quantised Phi-3 model compatible with ONNX Runtime.
Use ONNX Runtime with DirectML (Windows), CUDA (NVIDIA), or CPU execution providers.
Cache system prompts and templates; stream tokens to the UI for responsiveness.

How CPI can help

We help teams design and ship production-grade Phi-3 solutions—serverless endpoints, AKS deployments, secure RAG architectures, prompt and data pipelines, and cost governance. If you’re in discovery, we run short, fixed-scope pilots that de-risk the path to value.

Next steps

Pick a Phi-3 variant in Azure AI Studio and deploy a serverless endpoint.
Test with the Python or .NET snippet above.
Add RAG with Azure AI Search, then iterate on prompts, context size, and temperature.
Introduce logging, safety filters, and performance monitoring before going live.

Azure Phi-3 makes practical AI more attainable: fast, affordable, and flexible across cloud and edge. If you want a second set of hands to get into production, CloudPro Inc is ready to help.

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.