Integrate Tiktoken in Python Applications

In this blog post Integrate Tiktoken in Python Applications Step by Step Guide we will explore what tokens are, why they matter for large language models, and how to integrate OpenAI’s Tiktoken library into your Python application with a simple, step-by-step example.

Before we touch any code, let’s set the scene. Language models don’t think in characters or words—they think in tokens. Tokens are small pieces of text (word fragments, words, punctuation, even whitespace) that the model processes. Managing tokens helps you control cost, avoid context-length errors, and improve reliability. Tiktoken is a fast tokenizer used by OpenAI models that lets you count, slice, and reason about tokens in your app.

Why tokens matter to your application

If you’ve ever seen a “context length exceeded” error, you’ve met the token limit. Every model has a maximum context window (e.g., 8k, 32k, 128k tokens). Your prompt plus the model’s response must fit inside it. Token-aware applications can:

Prevent overruns by measuring and trimming inputs before sending them.
Control costs by estimating token usage per request.
Improve user experience by chunking long documents intelligently.
Design predictable prompt budgets for streaming or multi-turn workflows.

What Tiktoken does under the hood

Tiktoken implements a fast, model-aware tokenizer often used with OpenAI models. It is built in Rust with Python bindings for speed. The core idea is a variant of Byte Pair Encoding (BPE):

Text is first split by rules that respect spaces, punctuation, and Unicode.
Frequent byte pairs are merged repeatedly to form a vocabulary of tokens.
Common patterns (e.g., “ing”, “tion”, or “ the”) become single tokens; rare sequences break into multiple tokens.

Different models use different vocabularies and merge rules. That’s why picking the right encoding for your model matters. Tiktoken provides encoding_for_model to choose the correct encoder when you know the target model.

Step-by-step setup and quickstart

1. Install Tiktoken

pip install tiktoken

That’s it. No extra system dependencies required for most environments.

2. Choose the right encoding

Models have associated encodings. Tiktoken can pick the right one for many OpenAI models. If it doesn’t recognize the model, you can fall back to a general-purpose encoding such as cl100k_base.

import tiktoken

def get_encoder(model: str = "gpt-4o-mini"):
    """Return an encoder for the specified model, with a safe fallback."""
    try:
        return tiktoken.encoding_for_model(model)
    except KeyError:
        # Fallback works for many recent chat/completions models
        return tiktoken.get_encoding("cl100k_base")

3. Count tokens for plain text

def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
    enc = get_encoder(model)
    return len(enc.encode(text))

sample = "Hello from CloudPro! Let's keep prompts predictable and affordable."
print(count_tokens(sample))

You now have the most important primitive: the ability to count tokens.

Build a token-aware helper for your app

Let’s create a small utility that manages prompt budgets, truncates input safely, chunks long text, and estimates cost. You can drop this into any Python app—CLI, web service, or batch job.

Token budgeting and safe truncation

from dataclasses import dataclass
from typing import List

@dataclass
class TokenBudget:
    model: str
    max_context: int           # model's max tokens (prompt + completion)
    target_response_tokens: int
    price_in_per_token: float  # USD per input token (example values)
    price_out_per_token: float # USD per output token (example values)

    def max_prompt_tokens(self) -> int:
        # Leave headroom for the model's response
        return max(1, self.max_context - self.target_response_tokens)

    def estimate_cost_usd(self, prompt_tokens: int) -> float:
        # Very rough estimate: input + reserved output
        return round(prompt_tokens * self.price_in_per_token +
                     self.target_response_tokens * self.price_out_per_token, 6)


def truncate_to_tokens(text: str, max_tokens: int, model: str) -> str:
    enc = get_encoder(model)
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return enc.decode(tokens[:max_tokens])

Note: Prices change frequently. Plug in current pricing for your provider and target model.

Chunk long documents by tokens

Chunking by characters can split words awkwardly and doesn’t map to model limits. Chunking by tokens is much safer.

def chunk_by_tokens(text: str, chunk_size: int, overlap: int, model: str) -> List[str]:
    assert overlap < chunk_size, "overlap must be smaller than chunk_size"
    enc = get_encoder(model)
    toks = enc.encode(text)

    chunks = []
    start = 0
    while start < len(toks):
        end = min(start + chunk_size, len(toks))
        chunk = enc.decode(toks[start:end])
        chunks.append(chunk)
        if end == len(toks):
            break
        start = end - overlap
    return chunks

Approximate chat message counting

Chat messages include some structural tokens (role, message boundaries). The exact accounting varies by model, so treat this as an approximation for budgeting only.

from typing import Dict

def count_chat_tokens(messages: List[Dict[str, str]], model: str = "gpt-4o-mini") -> int:
    enc = get_encoder(model)
    # Heuristic overhead per message and for priming the reply
    tokens_per_message_overhead = 3
    tokens_for_reply_priming = 2

    total = 0
    for m in messages:
        total += tokens_per_message_overhead
        total += len(enc.encode(m.get("role", "")))
        total += len(enc.encode(m.get("content", "")))
        if "name" in m and m["name"]:
            total += len(enc.encode(m["name"]))
    return total + tokens_for_reply_priming

If you need exact counts, send a small test request to your provider and compare server-reported token usage against your local estimate, then tune the overhead constants for your model.

End-to-end example A small, practical CLI

Let’s put it all together. This example reads a text file, enforces a prompt budget, splits overflow into token-sized chunks, and prints token counts and an estimated cost. Replace model, context, and prices with values appropriate to your environment.

import argparse

DEFAULT_MODEL = "gpt-4o-mini"      # Example only; choose your actual target
MAX_CONTEXT = 128_000              # Example context window
TARGET_COMPLETION_TOKENS = 800     # Reserve for the model's answer
PRICE_IN = 0.000005                # Example USD per input token
PRICE_OUT = 0.000015               # Example USD per output token


def main():
    parser = argparse.ArgumentParser(description="Token-aware prompt budgeting")
    parser.add_argument("file", help="Path to input text file")
    parser.add_argument("--model", default=DEFAULT_MODEL)
    parser.add_argument("--chunk", type=int, default=4000, help="Chunk size in tokens")
    parser.add_argument("--overlap", type=int, default=200, help="Token overlap between chunks")
    args = parser.parse_args()

    with open(args.file, "r", encoding="utf-8") as f:
        text = f.read()

    # Count all tokens
    total_tokens = count_tokens(text, args.model)
    print(f"Total tokens in file: {total_tokens}")

    # Create a budget and enforce it
    budget = TokenBudget(
        model=args.model,
        max_context=MAX_CONTEXT,
        target_response_tokens=TARGET_COMPLETION_TOKENS,
        price_in_per_token=PRICE_IN,
        price_out_per_token=PRICE_OUT,
    )

    max_prompt = budget.max_prompt_tokens()
    if total_tokens <= max_prompt:
        prompt = text
        prompt_tokens = total_tokens
        print("Fits within prompt budget.")
    else:
        print("Prompt too large; chunking by tokens...")
        chunks = chunk_by_tokens(text, args.chunk, args.overlap, args.model)
        # Start with the first chunk and expand until we hit the budget
        enc = get_encoder(args.model)
        selected_tokens = []
        for c in chunks:
            toks = enc.encode(c)
            if len(selected_tokens) + len(toks) > max_prompt:
                break
            selected_tokens.extend(toks)
        prompt = enc.decode(selected_tokens)
        prompt_tokens = len(selected_tokens)
        print(f"Using {prompt_tokens} tokens for the prompt after chunking.")

    cost_est = budget.estimate_cost_usd(prompt_tokens)
    print(f"Estimated request cost (USD): ${cost_est}")

    # For demonstration, preview the first 400 characters of the prompt
    print("\n--- Prompt preview (first 400 chars) ---")
    print(prompt[:400])


if __name__ == "__main__":
    main()

From here, you can pass prompt to your LLM client of choice. Because you counted and constrained tokens beforehand, you’ll avoid context overruns and you’ll know the approximate cost.

Integration tips for production

Always use encoding_for_model when possible. If the model is unknown, fall back to a well-supported base encoding.
Leave generous buffer for the model’s reply. If you request 1,000 tokens back, don’t pack your prompt to the exact remaining capacity—keep some headroom.
Be careful with pasted logs, code, or binary-like content. Non-ASCII sequences can explode token counts.
Normalize newlines consistently. For example, convert \r\n to \n to keep counts consistent across platforms.
Cache encoders and avoid repeated encoding_for_model calls in hot paths.
Measure and compare. For critical workloads, compare local counts to the provider’s usage reports and adjust heuristics.

Common pitfalls

Assuming words ≈ tokens. In English, 1 token ~ ¾ of a word on average, but this varies. Emojis or CJK characters may tokenize differently.
Using character-based chunking. It’s easy but unreliable. Prefer token-based chunking for anything that must fit a context limit.
Copying chat-token formulas blindly. Structural overhead differs across models and versions. Use approximations for budgeting only and validate with real responses.
Forgetting to update encodings for new models. When you switch models, re-check encoders and budgets.

Testing your integration

Create fixtures with small, medium, and very large inputs. Verify your helper truncates or chunks correctly.
Write unit tests around count_tokens, truncate_to_tokens, and chunk_by_tokens with tricky inputs (emoji, code blocks, long URLs).
Smoke-test with your LLM provider and confirm server-side token usage matches your expectations.

Wrap up

Tiktoken gives your Python app the superpower to think like the model thinks—at the token level. With a few utilities for counting, truncation, and chunking, you can avoid context limit errors, make costs predictable, and keep user experience smooth. The examples above are intentionally minimal so you can drop them into your stack—CLI, FastAPI, or workers—and adapt them to your models and budgets.

If you’d like help productionising token-aware pipelines, the team at CloudProinc.com.au regularly builds reliable, cost-efficient LLM systems for engineering and product teams. Happy building!

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.