CalSync โ€” Automate Outlook Calendar Colors

Auto-color-code events for your team using rules. Faster visibility, less admin. 10-user minimum ยท 12-month term.

CalSync Colors is a service by CPI Consulting

In this blog post Deploy a Model with TensorFlow Serving on Docker and Kubernetes we will walk through how to package a TensorFlow model, serve it locally with Docker, and scale it on Kubernetes. Deploy a Model with TensorFlow Serving on Docker and Kubernetes is aimed at technical teams who want a reliable, fast, and maintainable way to serve models in production.

At a high level, TensorFlow Serving is a purpose-built, high-performance inference server. It loads models in TensorFlowโ€™s SavedModel format, exposes standard REST and gRPC endpoints, and supports model versioning and batching out of the box. Compared to DIY Flask or FastAPI wrappers, itโ€™s faster to stand up, easier to operate, and designed for zero-downtime upgrades.

What is TensorFlow Serving

TensorFlow Serving (TF Serving) is a C++ server that:

  • Reads TensorFlow SavedModel directories (versioned as 1, 2, 3โ€ฆ)
  • Serves predictions over HTTP/REST (default port 8501) and gRPC (default port 8500)
  • Hot-reloads new model versions and supports canarying/rollback
  • Optionally batches requests for higher throughput

Because itโ€™s optimized in C++ and tightly integrated with TensorFlow runtimes (CPU and GPU), you get strong performance without writing server code. Your team focuses on model training and packaging; TF Serving handles the serving.

Prerequisites

  • Docker installed locally
  • Python 3.9+ and TensorFlow for exporting a model
  • curl for quick REST testing

Step 1: Export a SavedModel

Weโ€™ll create a simple Keras model and export it in the SavedModel format, versioned under models/my_model/1. TF Serving looks for numeric subfolders representing versions.

This export includes a default signature (serving_default) TF Serving will use for inference.

Step 2: Serve locally with Docker

Run the official TF Serving container, mounting your model directory and exposing REST and gRPC ports:

What this does:

  • Binds REST on localhost:8501 and gRPC on localhost:8500
  • Loads the highest numeric version under /models/my_model
  • Exposes the model under the name my_model

Step 3: Send a prediction

Use REST for a quick test:

Youโ€™ll get back a JSON with predictions. In production, you can switch to gRPC for lower latency and better throughput, but REST is perfect for quick testing and many web services.

Step 4: Upgrade and roll back with versions

To deploy a new model version without downtime:

  • Export your updated model to models/my_model/2
  • Place it alongside version 1 on the same path
  • TF Serving will detect the new version and start serving it once loaded

Roll back by removing or disabling version 2; the server will return to serving the latest available version. You can tune how quickly it polls the filesystem with --file_system_poll_wait_seconds if needed.

Step 5: Serve multiple models

For multi-model setups, point TF Serving at a model config file:

Step 6: Move to Kubernetes

On Kubernetes, mount your model directory from a PersistentVolume and expose a Service. A minimal example:

Add an Ingress or API gateway with TLS, and consider autoscaling:

Performance and reliability tips

  • Batching: Enable batching to increase throughput under load.
  • CPU vs GPU: For heavy models or large batches, use tensorflow/serving:latest-gpu with NVIDIA Container Toolkit.
  • Model size and cold starts: Keep models lean, and pre-warm by sending a small request after rollout.
  • Versioning strategy: Always deploy to a new numeric folder (e.g., /2), test, then cut traffic. Keep N-1 for quick rollback.
  • Input validation: Enforce shapes and dtypes at your API edge to avoid malformed requests reaching TF Serving.
  • Observability: Log request IDs at the caller, track latency and error rates, and capture model version in every metric/event.
  • Security: Put TF Serving behind an Ingress or API gateway with TLS and authentication. Restrict direct access to ports 8500/8501.

Common pitfalls

  • Signature mismatches: Ensure your client payload matches the SavedModel signature (serving_default). If in doubt, inspect with saved_model_cli show --dir <path> --all.
  • Wrong JSON shape: REST instances must match the modelโ€™s expected shape. For a single vector input, wrap it as a list of lists.
  • Mount paths: The container must see versioned subfolders under the base path (/models/my_model/1, /2, โ€ฆ).
  • Resource limits: Without CPU/memory limits in Kubernetes, noisy neighbors can cause latency spikes. Set requests/limits and autoscaling.

Why this approach works

TF Serving abstracts the serving layer with an optimized, battle-tested server. Docker makes it reproducible on a laptop, CI, or any cloud VM. Kubernetes adds elasticity, resilience, and a paved path to GitOps and blue/green rollouts. Together, they remove bespoke server code and let your team focus on model quality and business impact.

Wrap-up

You now have a clean path from a trained TensorFlow model to a production-ready, scalable serving stack. Start with Docker for fast iteration, then move to Kubernetes when you need high availability and autoscaling. If you want help adapting this for your environmentโ€”object storage model syncing, canarying, observability, securityโ€”CloudProinc.com.au can assist with reference architectures and hands-on implementation.


Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.