Monitor Azure AI Services

In this blog post Monitor Azure AI Services with Metrics Alerts and Logging at Scale we will unpack how to observe, alert, and control the reliability and cost of your AI workloads on Azure.

Monitoring Azure AI Services is about more than charts. It’s how you keep latency low, errors rare, prompts safe, and costs predictable—especially when usage spikes. In this post, Monitor Azure AI Services with Metrics Alerts and Logging at Scale, we’ll start with the big picture and then walk through concrete steps, code snippets, and queries you can use today.

High-level view

Azure AI Services (including Azure OpenAI and the Cognitive Services family like Vision, Speech, Language, and Translator) emit telemetry you can capture with Azure Monitor. Out of the box you get platform metrics (e.g., latency, call counts, throttles). With diagnostic settings, you can stream richer logs (requests, usage, content filter outcomes) into a Log Analytics workspace. From there, dashboards, Kusto queries, and alerts help you detect regressions and manage cost.

The idea: define clear service level indicators (SLIs)—success rate, p95 latency, throttle rate, token usage, safety filter hits—then automate collection, visualization, and alerts. No heroics, just a reliable loop.

The technology behind the monitoring stack

Under the hood, Azure Monitor has two main data paths:

Metrics: Lightweight time-series stored in the Azure Metrics platform (MDM). You can chart them, alert on them, and optionally copy them to logs.
Logs: Resource logs and application logs streamed to Log Analytics via Diagnostic Settings. These are queryable with KQL and power Workbooks and scheduled (log) alerts.

Key components you’ll use:

Diagnostic Settings: Turn on and route logs/metrics to Log Analytics, Storage, or Event Hub.
Log Analytics Workspace: Central data store for KQL queries, Workbooks, and alerts.
Azure Metrics and Alerts: Metric-based alerting for near-real-time conditions.
Application Insights (optional): If you own the client or API, capture end-to-end traces and correlate with service-side telemetry.
Action Groups: Where alerts send notifications (email, Teams/Webhook, ITSM, PagerDuty, SMS, etc.).

What to monitor for AI workloads

Availability and reliability: success rate, HTTP status breakdown, dependency failures.
Performance: p50/p95 latency by model/operation, cold-start patterns, regional variance.
Capacity and throttling: 429 rates, requests-per-minute vs. quota, token throughput.
Cost drivers: tokens by model, prompt vs. completion tokens, per-team/project usage.
Safety and quality: content filter triggers, blocked prompts, jailbreak attempts.
Change detection: model version shifts, sudden error bursts, schema changes.

Define SLIs and SLOs

Start with 3–5 SLIs. For most teams:

Success rate ≥ 99%
Latency p95 ≤ 3 seconds
Throttle rate ≤ 1%
Token spend within budget
Safety block rate stable and explained

Turn on diagnostics

Enable diagnostic settings on each Azure AI resource (Azure OpenAI or Cognitive Services account). Route to a Log Analytics workspace and turn on metrics to logs if you want unified KQL queries.

CLI: list and enable diagnostic categories

# Variables
RG="<resource-group>"
AI_NAME="<cognitive-or-openai-account-name>"
WSID="/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<workspace>"
RESID=$(az resource show -g $RG -n $AI_NAME --resource-type Microsoft.CognitiveServices/accounts --query id -o tsv)

# Discover available categories (names vary by service and region)
az monitor diagnostic-settings categories list --resource $RESID -o table

# Enable metrics and a few common log categories (adjust to what you discovered)
az monitor diagnostic-settings create \
  --name "ai-diag" \
  --resource $RESID \
  --workspace $WSID \
  --metrics '[{"category":"AllMetrics","enabled":true}]' \
  --logs '[
    {"category":"Audit","enabled":true},
    {"category":"Usage","enabled":true},
    {"category":"ContentFilter","enabled":true}
  ]'

Tip: Only enable Request/Response logging if you must and ensure prompts/completions are masked or excluded to protect sensitive data.

Build pragmatic dashboards

Use Azure Monitor Workbooks or Grafana to chart:

Success rate and throttles over time
p95 latency by model and operation
Tokens per minute and daily token totals by team
Safety filter blocks and categories

Workbooks can query both AzureMetrics and log tables in one place.

Useful KQL snippets

Column names vary slightly by service. Use “Take 10” in the Log Analytics UI to inspect your schema, then adapt the examples below.

Reliability: success and throttle rates

let window = 1h;
AzureMetrics
| where TimeGenerated > ago(window)
| where Namespace contains "CognitiveServices" or Namespace contains "OpenAI"
| where MetricName in ("Requests", "Calls", "SuccessfulCalls", "ThrottledCalls", "Errors")
| summarize total=sum(Val) by MetricName, bin(TimeGenerated, 5m)
| evaluate pivot(MetricName, sum(Val))
| extend successRate = todouble(SuccessfulCalls) / todouble(total)
| extend throttleRate = todouble(ThrottledCalls) / todouble(total)

Latency p95 by model

AzureMetrics
| where Namespace contains "OpenAI" or Namespace contains "CognitiveServices"
| where MetricName in ("Latency", "ServerLatency", "EndToEndLatency")
| summarize p95LatencyMs=percentile(Val, 95) by bin(TimeGenerated, 5m), Model=tostring(dimensions["ModelName"])

Token usage and rough cost estimate

// Map per-1K token price for the models you use (example values; update for your region/contract)
let prices = datatable(Model:string, inputPer1K:real, outputPer1K:real) [
  "gpt-4o-mini", 0.15, 0.60,
  "gpt-4o",       5.00, 15.00
];
// Usage table names vary; if you enabled "Usage" logs routed to Log Analytics, query them or AzureMetrics tokens
let t = AzureMetrics
| where Namespace contains "OpenAI"
| where MetricName in ("PromptTokens", "CompletionTokens", "Tokens")
| project TimeGenerated, MetricName, Val, Model=tostring(dimensions["ModelName"]), Team=tostring(dimensions["CallerId"]);
// Aggregate by day and join prices
let daily = t
| summarize promptTokens=sumif(Val, MetricName=="PromptTokens"),
            completionTokens=sumif(Val, MetricName=="CompletionTokens")
    by bin(TimeGenerated, 1d), Model, Team;
daily
| join kind=leftouter prices on Model
| extend inputCostUSD = promptTokens/1000.0 * coalesce(inputPer1K, 0.0)
| extend outputCostUSD = completionTokens/1000.0 * coalesce(outputPer1K, 0.0)
| extend totalCostUSD = inputCostUSD + outputCostUSD
| project TimeGenerated, Team, Model, promptTokens, completionTokens, totalCostUSD

Safety signal: content filter blocks

AzureDiagnostics
| where Category contains "ContentFilter" or Category contains "Safety"
| summarize blocks=count() by bin(TimeGenerated, 15m), tostring(parse_json(Properties_s)["category"])

Alerts that matter

Prefer a small set of actionable alerts. Route to an Action Group that notifies a channel your team actually watches.

Reliability: Success rate < 99% over 10 minutes.
Performance: p95 latency > 3s over 10 minutes.
Capacity: Throttle rate > 1% for 5 minutes.
Cost: Daily token spend > budget for a team.
Safety: Sudden spike in content filter blocks.

Create a log alert with a scheduled query

# Create action group (once per env)
az monitor action-group create \
  -g $RG \
  -n ai-action-group \
  --action email team "ai-ops@example.com"

# Log alert: throttle rate > 1% in last 10 minutes
QUERY='let window=10m;
AzureMetrics
| where TimeGenerated > ago(window)
| where Namespace contains "CognitiveServices" or Namespace contains "OpenAI"
| where MetricName in ("Calls", "ThrottledCalls")
| summarize total=sumif(Val, MetricName=="Calls"), throttled=sumif(Val, MetricName=="ThrottledCalls")
| extend rate = iif(total==0, 0.0, throttled/total)
| where rate > 0.01'

az monitor scheduled-query create \
  -g $RG \
  -n ai-throttle-alert \
  --scopes $WSID \
  --condition "count >= 1" \
  --description "Throttle rate > 1%" \
  --action-groups ai-action-group \
  --query "$QUERY" \
  --evaluation-frequency 5m \
  --severity 2

Control cost and quotas

Set Azure Cost Management budgets with email/webhook alerts tied to the resource group hosting AI services.
Use per-team API keys or managed identities with dimension labels (CallerId, Project) so you can break down spend.
Track tokens per minute against model- and region-specific rate limits; alert before saturation.
Cache frequently used responses and keep prompts lean.

Governance, privacy, and retention

Minimize sensitive data in logs. If you must capture prompts, use data collection rules (DCR) with transformations to hash or drop fields.
Apply Azure Policy to require diagnostic settings on AI resources and to enforce log retention minimums.
Separate short-term hot logs (30–90 days) from long-term cold archives in Blob Storage.

Operate with runbooks

Document a few quick paths:

Latency spike runbook: check recent deployments, region health, model capacity announcements, and throttles; fail over to a secondary region or smaller model if needed.
Error burst runbook: capture sample failing requests, inspect content filter hits, roll back configuration toggles, and cordon noisy clients.
Cost overrun runbook: identify offending team via KQL, apply per-identity rate limits, and enable caching.

Automation and “monitoring as code”

Keep monitoring consistent across environments with IaC:

Use Bicep/Terraform to create Log Analytics, diagnostic settings, Workbooks, and alerts.
Template your action groups and alert rules; parameterize team emails and thresholds.
Run smoke tests after deploys—issue a few test prompts and verify they land in logs and dashboards.

Common pitfalls

Turning on verbose request/response logs without masking. Protect prompts and PII.
Only watching average latency. Track p95/p99; users feel the tail.
Alert noise. Tune thresholds and add time aggregation to avoid flapping.
No ownership tags. Without team/project labels you can’t attribute incidents or cost.
Ignoring 429s. Throttles often precede outages—alert early.

Wrapping up

When you combine Azure Monitor metrics with resource logs and a small set of thoughtful alerts, Azure AI Services become far more predictable. Pick your SLIs, wire up diagnostics, build a simple workbook, and add two or three high-signal alerts. The result is happier users, controlled costs, and fewer surprises.

If you want help standing this up across multiple subscriptions or regions, a short engagement to template diagnostics, workbooks, and alerts can pay for itself quickly—especially as usage scales.

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.