In this blog post Integrate Tiktoken in Python Applications Step by Step Guide we will explore what tokens are, why they matter for large language models, and how to integrate OpenAI’s Tiktoken library into your Python application with a simple, step-by-step example.
Before we touch any code, let’s set the scene. Language models don’t think in characters or words—they think in tokens. Tokens are small pieces of text (word fragments, words, punctuation, even whitespace) that the model processes. Managing tokens helps you control cost, avoid context-length errors, and improve reliability. Tiktoken is a fast tokenizer used by OpenAI models that lets you count, slice, and reason about tokens in your app.
Why tokens matter to your application
If you’ve ever seen a “context length exceeded” error, you’ve met the token limit. Every model has a maximum context window (e.g., 8k, 32k, 128k tokens). Your prompt plus the model’s response must fit inside it. Token-aware applications can:
- Prevent overruns by measuring and trimming inputs before sending them.
- Control costs by estimating token usage per request.
- Improve user experience by chunking long documents intelligently.
- Design predictable prompt budgets for streaming or multi-turn workflows.
What Tiktoken does under the hood
Tiktoken implements a fast, model-aware tokenizer often used with OpenAI models. It is built in Rust with Python bindings for speed. The core idea is a variant of Byte Pair Encoding (BPE):
- Text is first split by rules that respect spaces, punctuation, and Unicode.
- Frequent byte pairs are merged repeatedly to form a vocabulary of tokens.
- Common patterns (e.g., “ing”, “tion”, or “ the”) become single tokens; rare sequences break into multiple tokens.
Different models use different vocabularies and merge rules. That’s why picking the right encoding for your model matters. Tiktoken provides encoding_for_model
to choose the correct encoder when you know the target model.
Step-by-step setup and quickstart
1. Install Tiktoken
pip install tiktoken
That’s it. No extra system dependencies required for most environments.
2. Choose the right encoding
Models have associated encodings. Tiktoken can pick the right one for many OpenAI models. If it doesn’t recognize the model, you can fall back to a general-purpose encoding such as cl100k_base
.
import tiktoken
def get_encoder(model: str = "gpt-4o-mini"):
"""Return an encoder for the specified model, with a safe fallback."""
try:
return tiktoken.encoding_for_model(model)
except KeyError:
# Fallback works for many recent chat/completions models
return tiktoken.get_encoding("cl100k_base")
3. Count tokens for plain text
def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
enc = get_encoder(model)
return len(enc.encode(text))
sample = "Hello from CloudPro! Let's keep prompts predictable and affordable."
print(count_tokens(sample))
You now have the most important primitive: the ability to count tokens.
Build a token-aware helper for your app
Let’s create a small utility that manages prompt budgets, truncates input safely, chunks long text, and estimates cost. You can drop this into any Python app—CLI, web service, or batch job.
Token budgeting and safe truncation
from dataclasses import dataclass
from typing import List
@dataclass
class TokenBudget:
model: str
max_context: int # model's max tokens (prompt + completion)
target_response_tokens: int
price_in_per_token: float # USD per input token (example values)
price_out_per_token: float # USD per output token (example values)
def max_prompt_tokens(self) -> int:
# Leave headroom for the model's response
return max(1, self.max_context - self.target_response_tokens)
def estimate_cost_usd(self, prompt_tokens: int) -> float:
# Very rough estimate: input + reserved output
return round(prompt_tokens * self.price_in_per_token +
self.target_response_tokens * self.price_out_per_token, 6)
def truncate_to_tokens(text: str, max_tokens: int, model: str) -> str:
enc = get_encoder(model)
tokens = enc.encode(text)
if len(tokens) <= max_tokens:
return text
return enc.decode(tokens[:max_tokens])
Note: Prices change frequently. Plug in current pricing for your provider and target model.
Chunk long documents by tokens
Chunking by characters can split words awkwardly and doesn’t map to model limits. Chunking by tokens is much safer.
def chunk_by_tokens(text: str, chunk_size: int, overlap: int, model: str) -> List[str]:
assert overlap < chunk_size, "overlap must be smaller than chunk_size"
enc = get_encoder(model)
toks = enc.encode(text)
chunks = []
start = 0
while start < len(toks):
end = min(start + chunk_size, len(toks))
chunk = enc.decode(toks[start:end])
chunks.append(chunk)
if end == len(toks):
break
start = end - overlap
return chunks
Approximate chat message counting
Chat messages include some structural tokens (role, message boundaries). The exact accounting varies by model, so treat this as an approximation for budgeting only.
from typing import Dict
def count_chat_tokens(messages: List[Dict[str, str]], model: str = "gpt-4o-mini") -> int:
enc = get_encoder(model)
# Heuristic overhead per message and for priming the reply
tokens_per_message_overhead = 3
tokens_for_reply_priming = 2
total = 0
for m in messages:
total += tokens_per_message_overhead
total += len(enc.encode(m.get("role", "")))
total += len(enc.encode(m.get("content", "")))
if "name" in m and m["name"]:
total += len(enc.encode(m["name"]))
return total + tokens_for_reply_priming
If you need exact counts, send a small test request to your provider and compare server-reported token usage against your local estimate, then tune the overhead constants for your model.
End-to-end example A small, practical CLI
Let’s put it all together. This example reads a text file, enforces a prompt budget, splits overflow into token-sized chunks, and prints token counts and an estimated cost. Replace model, context, and prices with values appropriate to your environment.
import argparse
DEFAULT_MODEL = "gpt-4o-mini" # Example only; choose your actual target
MAX_CONTEXT = 128_000 # Example context window
TARGET_COMPLETION_TOKENS = 800 # Reserve for the model's answer
PRICE_IN = 0.000005 # Example USD per input token
PRICE_OUT = 0.000015 # Example USD per output token
def main():
parser = argparse.ArgumentParser(description="Token-aware prompt budgeting")
parser.add_argument("file", help="Path to input text file")
parser.add_argument("--model", default=DEFAULT_MODEL)
parser.add_argument("--chunk", type=int, default=4000, help="Chunk size in tokens")
parser.add_argument("--overlap", type=int, default=200, help="Token overlap between chunks")
args = parser.parse_args()
with open(args.file, "r", encoding="utf-8") as f:
text = f.read()
# Count all tokens
total_tokens = count_tokens(text, args.model)
print(f"Total tokens in file: {total_tokens}")
# Create a budget and enforce it
budget = TokenBudget(
model=args.model,
max_context=MAX_CONTEXT,
target_response_tokens=TARGET_COMPLETION_TOKENS,
price_in_per_token=PRICE_IN,
price_out_per_token=PRICE_OUT,
)
max_prompt = budget.max_prompt_tokens()
if total_tokens <= max_prompt:
prompt = text
prompt_tokens = total_tokens
print("Fits within prompt budget.")
else:
print("Prompt too large; chunking by tokens...")
chunks = chunk_by_tokens(text, args.chunk, args.overlap, args.model)
# Start with the first chunk and expand until we hit the budget
enc = get_encoder(args.model)
selected_tokens = []
for c in chunks:
toks = enc.encode(c)
if len(selected_tokens) + len(toks) > max_prompt:
break
selected_tokens.extend(toks)
prompt = enc.decode(selected_tokens)
prompt_tokens = len(selected_tokens)
print(f"Using {prompt_tokens} tokens for the prompt after chunking.")
cost_est = budget.estimate_cost_usd(prompt_tokens)
print(f"Estimated request cost (USD): ${cost_est}")
# For demonstration, preview the first 400 characters of the prompt
print("\n--- Prompt preview (first 400 chars) ---")
print(prompt[:400])
if __name__ == "__main__":
main()
From here, you can pass prompt
to your LLM client of choice. Because you counted and constrained tokens beforehand, you’ll avoid context overruns and you’ll know the approximate cost.
Integration tips for production
- Always use
encoding_for_model
when possible. If the model is unknown, fall back to a well-supported base encoding. - Leave generous buffer for the model’s reply. If you request 1,000 tokens back, don’t pack your prompt to the exact remaining capacity—keep some headroom.
- Be careful with pasted logs, code, or binary-like content. Non-ASCII sequences can explode token counts.
- Normalize newlines consistently. For example, convert
\r\n
to\n
to keep counts consistent across platforms. - Cache encoders and avoid repeated
encoding_for_model
calls in hot paths. - Measure and compare. For critical workloads, compare local counts to the provider’s usage reports and adjust heuristics.
Common pitfalls
- Assuming words ≈ tokens. In English, 1 token ~ ¾ of a word on average, but this varies. Emojis or CJK characters may tokenize differently.
- Using character-based chunking. It’s easy but unreliable. Prefer token-based chunking for anything that must fit a context limit.
- Copying chat-token formulas blindly. Structural overhead differs across models and versions. Use approximations for budgeting only and validate with real responses.
- Forgetting to update encodings for new models. When you switch models, re-check encoders and budgets.
Testing your integration
- Create fixtures with small, medium, and very large inputs. Verify your helper truncates or chunks correctly.
- Write unit tests around
count_tokens
,truncate_to_tokens
, andchunk_by_tokens
with tricky inputs (emoji, code blocks, long URLs). - Smoke-test with your LLM provider and confirm server-side token usage matches your expectations.
Wrap up
Tiktoken gives your Python app the superpower to think like the model thinks—at the token level. With a few utilities for counting, truncation, and chunking, you can avoid context limit errors, make costs predictable, and keep user experience smooth. The examples above are intentionally minimal so you can drop them into your stack—CLI, FastAPI, or workers—and adapt them to your models and budgets.
If you’d like help productionising token-aware pipelines, the team at CloudProinc.com.au regularly builds reliable, cost-efficient LLM systems for engineering and product teams. Happy building!
Discover more from CPI Consulting
Subscribe to get the latest posts sent to your email.