Preparing Input Text for Training LLMs

In this blog post Preparing Input Text for Training LLMs that Perform in Production we will walk through the decisions and steps that make training data truly useful. Whether you’re pretraining from scratch or fine-tuning an existing model, disciplined data prep is where most of the performance is won.

Preparing Input Text for Training LLMs that Perform in Production starts with a simple idea: the model learns to predict the next token. Everything you do—cleaning, deduping, chunking, and formatting—should increase the signal-to-noise ratio of those tokens. We’ll keep things practical and friendly, with clear explanations and copy-pasteable snippets you can adapt to your pipeline.

The technology behind modern LLM training

Large Language Models are transformer networks trained on next-token prediction. Text is converted into tokens by a tokenizer (commonly BPE, WordPiece, or SentencePiece). The model attends over a finite context window—thousands of tokens—and updates weights to minimize prediction error. Quality and structure of the token stream matter:

Tokenization: Determines how text is split. Consistency affects chunking, special tokens, and chat formatting.
Objective: Pretraining uses raw text. Supervised fine-tuning (SFT) uses instruction/response pairs. RLHF or DPO layers preference signals on top.
Context window: Long documents must be chunked; boundaries and overlaps matter.
Data mixture and weighting: Metadata lets you weight sources and balance domains.

Why data preparation matters

Badly prepared data leads to brittle models: memorization from duplicates, confusion from mixed formats, or hallucinations from noisy sources. Good prep yields stable loss curves, better generalization, and fewer surprises in production.

Decide your dataset shape before you start

For pretraining-style data

Use JSONL with a text field.
Insert explicit document separators (e.g., <|doc|>) so the model learns boundaries.
Keep metadata in side fields for weighting and auditing.

{"text": "<|doc|>Title\nBody...\n" , "source": "kb", "lang": "en", "license": "CC-BY"}

For instruction-tuning (chat) data

Prefer a messages array with roles: system, user, assistant.
Apply the tokenizer’s chat template so special tokens are correct.

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "How do I restart the service?"},
    {"role": "assistant", "content": "Run `systemctl restart mysvc`."}
  ],
  "source": "support_runbook",
  "lang": "en"
}

Cleaning and normalization that pays off

HTML to text: Strip tags but preserve structure (headings, lists, code blocks). Avoid including boilerplate (menus, cookie banners).
Unicode normalization: Apply NFKC to unify visually similar code points. Fix mojibake with ftfy-like tools.
Whitespace control: Collapse repeated spaces; normalize newlines to \n.
Language filtering: Keep only target languages or label them accurately.
PII and secrets: Redact or drop. Decide your policy upfront and apply consistently.
Length filtering: Drop extremely short or extremely long junk.

Deduplication and diversity

Duplicates inflate loss and promote memorization. Use both exact and near-duplicate detection:

Exact dedup: Hash normalized text (e.g., SHA-1).
Near-dup: MinHash/SimHash over token shingles to catch minor variations.
Boilerplate removal: Dedup at paragraph or section level, not only at document level.

Chunking for the model’s context window

Chunk by tokens, not characters. Keep chunks coherent:

Target length: e.g., 1,024–4,096 tokens depending on model/context.
Overlap: 50–200 tokens can preserve continuity for long narratives.
Respect blocks: Don’t split inside code fences or tables; close blocks before cutting.
Add markers: Use <|doc|> or end-of-turn tokens to teach structure.

Metadata that unlocks control

Attach fields like source, domain, lang, license, timestamp, quality_score, pii_redacted. This enables:

Mixture weighting (e.g., upweight docs and downweight forums).
Auditable provenance for governance and takedowns.
Targeted evaluation slices by domain or timeframe.

Instruction fine-tuning specifics

Consistency: One schema and punctuation style for answers.
Coverage: Include reasoning, tools, code, and error handling patterns your product needs.
Difficulty curriculum: Start simple, include moderately hard tasks, avoid contrived corner cases unless product-relevant.
No leakage: Keep eval-like prompts out of training.

Train/validation/test splits the right way

Split by document, not by chunk, to avoid leakage.
Deduplicate across the entire corpus before splitting.
Stratify by source/language/length so each split mirrors production.

Governance and safety

Licensing: Track licenses and respect usage terms.
PII: Redact or remove. Consider hashing or placeholder tokens for structured IDs.
Safety categories: Label sensitive content to control sampling or train refusal behavior.

A practical pipeline you can adapt

# pip install datasets beautifulsoup4 ftfy langdetect datasketch transformers
from datasets import load_dataset, Dataset
from bs4 import BeautifulSoup
from langdetect import detect
from datasketch import MinHash, MinHashLSH
from transformers import AutoTokenizer
import unicodedata, hashlib, json, re

# 1) Load raw HTML/text data (example: local JSONL with {"html":..., "url":...})
raw = load_dataset("json", data_files={"train": "raw.jsonl"})["train"]

def html_to_text(html):
    soup = BeautifulSoup(html, "html.parser")
    # remove obvious boilerplate
    for tag in soup(["script", "style", "nav", "footer", "header"]):
        tag.decompose()
    text = soup.get_text("\n")
    # normalize whitespace
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]{2,}", " ", text)
    return text.strip()

def normalize_unicode(s):
    s = unicodedata.normalize("NFKC", s)
    s = s.replace("\r\n", "\n").replace("\r", "\n")
    return s

def clean_record(rec):
    txt = rec.get("text") or html_to_text(rec.get("html", ""))
    txt = normalize_unicode(txt)
    if len(txt) < 100:  # length filter
        return None
    try:
        lang = detect(txt)
    except Exception:
        lang = "unk"
    if lang != "en":  # keep only English for this example
        return None
    return {
        "text": f"<|doc|>\n{txt}\n",
        "source": rec.get("url", "unknown"),
        "lang": lang,
    }

cleaned = [c for c in (clean_record(r) for r in raw) if c]

# 2) Exact dedup by normalized hash
seen = set()
unique = []
for r in cleaned:
    h = hashlib.sha1(r["text"].encode("utf-8")).hexdigest()
    if h not in seen:
        seen.add(h)
        unique.append(r)

# 3) Near-duplicate removal with MinHash on 5-gram word shingles
lsh = MinHashLSH(threshold=0.8, num_perm=64)
minhashes = []
for i, r in enumerate(unique):
    m = MinHash(num_perm=64)
    tokens = re.findall(r"\w+", r["text"].lower())
    shingles = [" ".join(tokens[j:j+5]) for j in range(max(1, len(tokens)-4))]
    for s in shingles:
        m.update(s.encode("utf-8"))
    lsh.insert(str(i), m)
    minhashes.append(m)

keep_mask = [True]*len(unique)
for i, m in enumerate(minhashes):
    if not keep_mask[i]:
        continue
    near = lsh.query(m)
    for j in near:
        j = int(j)
        if j > i:
            keep_mask[j] = False

deduped = [r for r, k in zip(unique, keep_mask) if k]

# 4) Token-aware chunking
model_id = "meta-llama/Llama-3.1-8B"  # example; use your target tokenizer
_tok = AutoTokenizer.from_pretrained(model_id)
MAX_TOK = 2048
OVERLAP = 128

def chunk_text(record):
    text = record["text"]
    ids = _tok.encode(text)
    chunks = []
    start = 0
    while start < len(ids):
        end = min(start + MAX_TOK, len(ids))
        piece = _tok.decode(ids[start:end], skip_special_tokens=False)
        chunks.append({
            "text": piece,
            "source": record["source"],
            "lang": record["lang"]
        })
        if end == len(ids):
            break
        start = end - OVERLAP  # maintain continuity
    return chunks

chunked = []
for r in deduped:
    chunked.extend(chunk_text(r))

# 5) Train/val/test split by source (document-level)
from collections import defaultdict
by_source = defaultdict(list)
for r in chunked:
    by_source[r["source"]].append(r)

sources = list(by_source.keys())
import random
random.seed(42)
random.shuffle(sources)

n = len(sources)
train_s, val_s, test_s = sources[:int(0.9*n)], sources[int(0.9*n):int(0.95*n)], sources[int(0.95*n):]

def gather(sources):
    out = []
    for s in sources:
        out.extend(by_source[s])
    return out

train, val, test = gather(train_s), gather(val_s), gather(test_s)

# 6) Save JSONL
for name, data in [("train.jsonl", train), ("val.jsonl", val), ("test.jsonl", test)]:
    with open(name, "w", encoding="utf-8") as f:
        for r in data:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

print("Saved chunks:", len(train), len(val), len(test))

Applying a chat template for instruction data

from transformers import AutoTokenizer
import json

# Suppose each record has a `messages` array as shown earlier
chat_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

def render_chat(record):
    rendered = chat_tokenizer.apply_chat_template(
        record["messages"], tokenize=False, add_generation_prompt=False
    )
    return {"text": rendered, **{k: v for k, v in record.items() if k != "messages"}}

with open("sft_raw.jsonl", "r", encoding="utf-8") as inp, open("sft_prepared.jsonl", "w", encoding="utf-8") as out:
    for line in inp:
        rec = json.loads(line)
        out.write(json.dumps(render_chat(rec), ensure_ascii=False) + "\n")

Common pitfalls to avoid

Token-agnostic chunking: Splitting by characters causes variable token lengths and broken structures.
Inconsistent schemas: Mixing prompt/completion with chat-format without clear separators confuses the model.
Leakage: Randomly splitting chunks rather than documents leaks content across train and eval.
Overlapping duplicates: Dedup only at document level misses repeated paragraphs and boilerplate.
Missing special tokens: Not using the tokenizer’s chat template leads to mismatched turn boundaries.
Ignoring licensing/PII: Legal and privacy risks can derail deployments.

A quick checklist

Define your dataset shape: raw text vs chat messages.
Normalize Unicode, strip boilerplate, and fix whitespace.
Filter by language and length; redact PII.
Deduplicate exact and near-duplicates across the whole corpus.
Chunk by tokens with sensible overlap; mark document boundaries.
Attach useful metadata for weighting and audits.
Split by document for train/val/test and verify no leakage.
Use chat templates for instruction data; verify special tokens.
Log every transform for reproducibility and rollback.

Closing thoughts

LLM quality is a direct reflection of the tokens you feed the model. With consistent schemas, careful normalization, robust deduplication, and token-aware chunking, you’ll get cleaner signals, smoother training, and better production behavior. Start simple, instrument your pipeline, and iterate with small evaluations. Your future self—and your model—will thank you.

Discover more from CPI Consulting

Subscribe to get the latest posts sent to your email.