CalSync — Automate Outlook Calendar Colors

Auto-color-code events for your team using rules. Faster visibility, less admin. 10-user minimum · 12-month term.

CalSync Colors is a service by CPI Consulting

In this blog post Monitor Azure AI Services with Metrics Alerts and Logging at Scale we will unpack how to observe, alert, and control the reliability and cost of your AI workloads on Azure.

Monitoring Azure AI Services is about more than charts. It’s how you keep latency low, errors rare, prompts safe, and costs predictable—especially when usage spikes. In this post, Monitor Azure AI Services with Metrics Alerts and Logging at Scale, we’ll start with the big picture and then walk through concrete steps, code snippets, and queries you can use today.

High-level view

Azure AI Services (including Azure OpenAI and the Cognitive Services family like Vision, Speech, Language, and Translator) emit telemetry you can capture with Azure Monitor. Out of the box you get platform metrics (e.g., latency, call counts, throttles). With diagnostic settings, you can stream richer logs (requests, usage, content filter outcomes) into a Log Analytics workspace. From there, dashboards, Kusto queries, and alerts help you detect regressions and manage cost.

The idea: define clear service level indicators (SLIs)—success rate, p95 latency, throttle rate, token usage, safety filter hits—then automate collection, visualization, and alerts. No heroics, just a reliable loop.

The technology behind the monitoring stack

Under the hood, Azure Monitor has two main data paths:

  • Metrics: Lightweight time-series stored in the Azure Metrics platform (MDM). You can chart them, alert on them, and optionally copy them to logs.
  • Logs: Resource logs and application logs streamed to Log Analytics via Diagnostic Settings. These are queryable with KQL and power Workbooks and scheduled (log) alerts.

Key components you’ll use:

  • Diagnostic Settings: Turn on and route logs/metrics to Log Analytics, Storage, or Event Hub.
  • Log Analytics Workspace: Central data store for KQL queries, Workbooks, and alerts.
  • Azure Metrics and Alerts: Metric-based alerting for near-real-time conditions.
  • Application Insights (optional): If you own the client or API, capture end-to-end traces and correlate with service-side telemetry.
  • Action Groups: Where alerts send notifications (email, Teams/Webhook, ITSM, PagerDuty, SMS, etc.).

What to monitor for AI workloads

  • Availability and reliability: success rate, HTTP status breakdown, dependency failures.
  • Performance: p50/p95 latency by model/operation, cold-start patterns, regional variance.
  • Capacity and throttling: 429 rates, requests-per-minute vs. quota, token throughput.
  • Cost drivers: tokens by model, prompt vs. completion tokens, per-team/project usage.
  • Safety and quality: content filter triggers, blocked prompts, jailbreak attempts.
  • Change detection: model version shifts, sudden error bursts, schema changes.

Define SLIs and SLOs

Start with 3–5 SLIs. For most teams:

  • Success rate ≥ 99%
  • Latency p95 ≤ 3 seconds
  • Throttle rate ≤ 1%
  • Token spend within budget
  • Safety block rate stable and explained

Turn on diagnostics

Enable diagnostic settings on each Azure AI resource (Azure OpenAI or Cognitive Services account). Route to a Log Analytics workspace and turn on metrics to logs if you want unified KQL queries.

CLI: list and enable diagnostic categories

Tip: Only enable Request/Response logging if you must and ensure prompts/completions are masked or excluded to protect sensitive data.

Build pragmatic dashboards

Use Azure Monitor Workbooks or Grafana to chart:

  • Success rate and throttles over time
  • p95 latency by model and operation
  • Tokens per minute and daily token totals by team
  • Safety filter blocks and categories

Workbooks can query both AzureMetrics and log tables in one place.

Useful KQL snippets

Column names vary slightly by service. Use “Take 10” in the Log Analytics UI to inspect your schema, then adapt the examples below.

Reliability: success and throttle rates

Latency p95 by model

AzureMetrics
| where Namespace contains "OpenAI" or Namespace contains "CognitiveServices"
| where MetricName in ("Latency", "ServerLatency", "EndToEndLatency")
| summarize p95LatencyMs=percentile(Val, 95) by bin(TimeGenerated, 5m), Model=tostring(dimensions["ModelName"])

Token usage and rough cost estimate

Safety signal: content filter blocks

AzureDiagnostics
| where Category contains "ContentFilter" or Category contains "Safety"
| summarize blocks=count() by bin(TimeGenerated, 15m), tostring(parse_json(Properties_s)["category"])

Alerts that matter

Prefer a small set of actionable alerts. Route to an Action Group that notifies a channel your team actually watches.

  • Reliability: Success rate < 99% over 10 minutes.
  • Performance: p95 latency > 3s over 10 minutes.
  • Capacity: Throttle rate > 1% for 5 minutes.
  • Cost: Daily token spend > budget for a team.
  • Safety: Sudden spike in content filter blocks.

Create a log alert with a scheduled query

Control cost and quotas

  • Set Azure Cost Management budgets with email/webhook alerts tied to the resource group hosting AI services.
  • Use per-team API keys or managed identities with dimension labels (CallerId, Project) so you can break down spend.
  • Track tokens per minute against model- and region-specific rate limits; alert before saturation.
  • Cache frequently used responses and keep prompts lean.

Governance, privacy, and retention

  • Minimize sensitive data in logs. If you must capture prompts, use data collection rules (DCR) with transformations to hash or drop fields.
  • Apply Azure Policy to require diagnostic settings on AI resources and to enforce log retention minimums.
  • Separate short-term hot logs (30–90 days) from long-term cold archives in Blob Storage.

Operate with runbooks

Document a few quick paths:

  • Latency spike runbook: check recent deployments, region health, model capacity announcements, and throttles; fail over to a secondary region or smaller model if needed.
  • Error burst runbook: capture sample failing requests, inspect content filter hits, roll back configuration toggles, and cordon noisy clients.
  • Cost overrun runbook: identify offending team via KQL, apply per-identity rate limits, and enable caching.

Automation and “monitoring as code”

Keep monitoring consistent across environments with IaC:

  • Use Bicep/Terraform to create Log Analytics, diagnostic settings, Workbooks, and alerts.
  • Template your action groups and alert rules; parameterize team emails and thresholds.
  • Run smoke tests after deploys—issue a few test prompts and verify they land in logs and dashboards.

Common pitfalls

  • Turning on verbose request/response logs without masking. Protect prompts and PII.
  • Only watching average latency. Track p95/p99; users feel the tail.
  • Alert noise. Tune thresholds and add time aggregation to avoid flapping.
  • No ownership tags. Without team/project labels you can’t attribute incidents or cost.
  • Ignoring 429s. Throttles often precede outages—alert early.

Wrapping up

When you combine Azure Monitor metrics with resource logs and a small set of thoughtful alerts, Azure AI Services become far more predictable. Pick your SLIs, wire up diagnostics, build a simple workbook, and add two or three high-signal alerts. The result is happier users, controlled costs, and fewer surprises.

If you want help standing this up across multiple subscriptions or regions, a short engagement to template diagnostics, workbooks, and alerts can pay for itself quickly—especially as usage scales.


Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.