Simple Python AI App with Llama to Describe Images

Want a quick win with multimodal AI? In this post, you’ll build a tiny Python app that uses a Llama vision model to look at an image and tell you what it is. We’ll use Ollama to run the model locally (no paid keys required), then call it from Python. By the end, you’ll run the script on your machine and see real descriptions printed to your terminal.

What we’ll build

A command-line tool:

Input: an image file (JPEG/PNG).
Output: a short description of what’s in the image, generated by a Llama multimodal model.
Stack: Python + ollama (local LLM runtime) + Llama vision model.

Prerequisites

Python 3.9+ and pip
Ollama installed (macOS, Windows, or Linux):
- Download and install from the Ollama site, then ensure ollama works in your terminal: ollama --version
Pull a Llama vision model. As of writing, a good default is: ollama pull llama3.2-vision If your Ollama catalog differs, any “vision” or “llava” compatible model works. You can list models with ollama list.

Create a Python virtual environment

python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

Install dependencies:

pip install ollama pillow

ollama is the official Python client for the local Ollama server.
Pillow (PIL) lets us open/validate image files and optionally resize/convert.

The Python app

Create describe_image.py:

import sys
from pathlib import Path
from typing import Optional

from PIL import Image
import ollama  # pip install ollama

MODEL_NAME = "llama3.2-vision"

PROMPT_TEMPLATE = """You are an assistant that describes images clearly and concisely.
Return 1–3 sentences describing the main subject and notable details.
Be specific but avoid guessing text that isn’t legible."""

def validate_image(path: Path) -> Optional[Path]:
    if not path.exists():
        print(f"Error: file not found: {path}")
        return None
    try:
        with Image.open(path) as im:
            im.verify()  # quick integrity check
        return path
    except Exception as e:
        print(f"Error: not a valid image ({e})")
        return None

def describe_image(image_path: Path) -> str:
    """
    Sends the image to the local Llama vision model via Ollama and returns the description.
    """
    # Ollama accepts an array of messages; attach the image path on the user message.
    response = ollama.chat(
        model=MODEL_NAME,
        messages=[{
            "role": "user",
            "content": PROMPT_TEMPLATE,
            "images": [str(image_path)]
        }],
        options={
            # Tweak generation if you like:
            "temperature": 0.2,
            "num_predict": 256
        }
    )
    # The assistant's message is in response['message']['content']
    return response["message"]["content"].strip()

def main():
    if len(sys.argv) < 2:
        print("Usage: python describe_image.py <path-to-image>")
        sys.exit(1)

    image_path = Path(sys.argv[1])
    valid = validate_image(image_path)
    if not valid:
        sys.exit(2)

    print(f"Using model: {MODEL_NAME}")
    print(f"Analyzing image: {valid}\n")

    try:
        description = describe_image(valid)
        print("Description:\n")
        print(description)
    except ollama.ResponseError as e:
        print(f"Ollama error: {e}")
        print("Tip: Make sure the model is pulled and the Ollama service is running.")
        sys.exit(3)
    except Exception as e:
        print(f"Unexpected error: {e}")
        sys.exit(4)

if __name__ == "__main__":
    main()

What the code does

Validates the image path and ensures the file is actually an image.
Sends a message to the Llama vision model via ollama.chat(), attaching the image in the images array.
Prints the model’s natural-language description.

Run the app

Start (or ensure) the Ollama service is running. On most systems, Ollama runs as a background service after installation. If needed: ollama serve (Keep that window open, or run it as a service.)
In another terminal where your virtual environment is active, run: python describe_image.py ./samples/cat.jpg Replace ./samples/cat.jpg with any local JPG/PNG.

Example output

Using model: llama3.2-vision
Analyzing image: samples/cat.jpg

Description:

A brown tabby cat is sitting on a windowsill, looking slightly to the left. 
Sunlight highlights the fur, and a blurred outdoor scene is visible beyond the glass.

Your output will vary; that’s normal. If the response looks too long or too vague, lower temperature or tighten the prompt (e.g., “Return a single concise sentence.”).

Troubleshooting

Ollama error: model not found
You likely didn’t pull the model. Run: ollama pull llama3.2-vision
Connection refused or timeouts
Ensure the Ollama server is running: ollama serve (or restart it).
On Windows/macOS, try quitting/relaunching the Ollama app.
Slow first run
The first inference can be slower while weights are paged into memory. Subsequent runs are faster.

Extending the app

Batch mode: Loop over a folder and describe every image.
Structured output: Ask the model to return JSON (e.g., { "objects": [...], "scene": "..." }) and parse it.
Confidence: Prompt the model to include uncertainty (e.g., “If unsure, say you’re not certain.”).

Recap

You just built a tiny—but powerful—Python app that uses a local Llama vision model to describe images. With Ollama handling the model runtime and a few lines of Python, you can experiment with multimodal AI entirely on your machine. Try different images, tweak the prompt, and explore richer outputs like object lists or captions tailored for your use case.

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.