Want a quick win with multimodal AI? In this post, you’ll build a tiny Python app that uses a Llama vision model to look at an image and tell you what it is. We’ll use Ollama to run the model locally (no paid keys required), then call it from Python. By the end, you’ll run the script on your machine and see real descriptions printed to your terminal.

What we’ll build

A command-line tool:

  • Input: an image file (JPEG/PNG).
  • Output: a short description of what’s in the image, generated by a Llama multimodal model.
  • Stack: Python + ollama (local LLM runtime) + Llama vision model.

Prerequisites

  1. Python 3.9+ and pip
  2. Ollama installed (macOS, Windows, or Linux):
    • Download and install from the Ollama site, then ensure ollama works in your terminal: ollama --version
  3. Pull a Llama vision model. As of writing, a good default is: ollama pull llama3.2-vision If your Ollama catalog differs, any “vision” or “llava” compatible model works. You can list models with ollama list.

Create a Python virtual environment

Install dependencies:

  • ollama is the official Python client for the local Ollama server.
  • Pillow (PIL) lets us open/validate image files and optionally resize/convert.

The Python app

Create describe_image.py:

What the code does

  • Validates the image path and ensures the file is actually an image.
  • Sends a message to the Llama vision model via ollama.chat(), attaching the image in the images array.
  • Prints the model’s natural-language description.

Run the app

  1. Start (or ensure) the Ollama service is running. On most systems, Ollama runs as a background service after installation. If needed: ollama serve (Keep that window open, or run it as a service.)
  2. In another terminal where your virtual environment is active, run: python describe_image.py ./samples/cat.jpg Replace ./samples/cat.jpg with any local JPG/PNG.

Example output

Your output will vary; that’s normal. If the response looks too long or too vague, lower temperature or tighten the prompt (e.g., “Return a single concise sentence.”).

Troubleshooting

  • Ollama error: model not found
    You likely didn’t pull the model. Run: ollama pull llama3.2-vision
  • Connection refused or timeouts
    Ensure the Ollama server is running: ollama serve (or restart it).
    On Windows/macOS, try quitting/relaunching the Ollama app.
  • Slow first run
    The first inference can be slower while weights are paged into memory. Subsequent runs are faster.

Extending the app

  • Batch mode: Loop over a folder and describe every image.
  • Structured output: Ask the model to return JSON (e.g., { "objects": [...], "scene": "..." }) and parse it.
  • Confidence: Prompt the model to include uncertainty (e.g., “If unsure, say you’re not certain.”).

Recap

You just built a tiny—but powerful—Python app that uses a local Llama vision model to describe images. With Ollama handling the model runtime and a few lines of Python, you can experiment with multimodal AI entirely on your machine. Try different images, tweak the prompt, and explore richer outputs like object lists or captions tailored for your use case.


Discover more from CPI Consulting

Subscribe to get the latest posts sent to your email.