In this blog post Preparing Input Text for Training LLMs that Perform in Production we will walk through the decisions and steps that make training data truly useful. Whether you’re pretraining from scratch or fine-tuning an existing model, disciplined data prep is where most of the performance is won.
Preparing Input Text for Training LLMs that Perform in Production starts with a simple idea: the model learns to predict the next token. Everything you do—cleaning, deduping, chunking, and formatting—should increase the signal-to-noise ratio of those tokens. We’ll keep things practical and friendly, with clear explanations and copy-pasteable snippets you can adapt to your pipeline.
The technology behind modern LLM training
Large Language Models are transformer networks trained on next-token prediction. Text is converted into tokens by a tokenizer (commonly BPE, WordPiece, or SentencePiece). The model attends over a finite context window—thousands of tokens—and updates weights to minimize prediction error. Quality and structure of the token stream matter:
- Tokenization: Determines how text is split. Consistency affects chunking, special tokens, and chat formatting.
- Objective: Pretraining uses raw text. Supervised fine-tuning (SFT) uses instruction/response pairs. RLHF or DPO layers preference signals on top.
- Context window: Long documents must be chunked; boundaries and overlaps matter.
- Data mixture and weighting: Metadata lets you weight sources and balance domains.
Why data preparation matters
Badly prepared data leads to brittle models: memorization from duplicates, confusion from mixed formats, or hallucinations from noisy sources. Good prep yields stable loss curves, better generalization, and fewer surprises in production.
Decide your dataset shape before you start
For pretraining-style data
- Use JSONL with a
text
field. - Insert explicit document separators (e.g.,
<|doc|>
) so the model learns boundaries. - Keep metadata in side fields for weighting and auditing.
{"text": "<|doc|>Title\nBody...\n" , "source": "kb", "lang": "en", "license": "CC-BY"}
For instruction-tuning (chat) data
- Prefer a
messages
array with roles:system
,user
,assistant
. - Apply the tokenizer’s chat template so special tokens are correct.
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "How do I restart the service?"},
{"role": "assistant", "content": "Run `systemctl restart mysvc`."}
],
"source": "support_runbook",
"lang": "en"
}
Cleaning and normalization that pays off
- HTML to text: Strip tags but preserve structure (headings, lists, code blocks). Avoid including boilerplate (menus, cookie banners).
- Unicode normalization: Apply NFKC to unify visually similar code points. Fix mojibake with ftfy-like tools.
- Whitespace control: Collapse repeated spaces; normalize newlines to
\n
. - Language filtering: Keep only target languages or label them accurately.
- PII and secrets: Redact or drop. Decide your policy upfront and apply consistently.
- Length filtering: Drop extremely short or extremely long junk.
Deduplication and diversity
Duplicates inflate loss and promote memorization. Use both exact and near-duplicate detection:
- Exact dedup: Hash normalized text (e.g., SHA-1).
- Near-dup: MinHash/SimHash over token shingles to catch minor variations.
- Boilerplate removal: Dedup at paragraph or section level, not only at document level.
Chunking for the model’s context window
Chunk by tokens, not characters. Keep chunks coherent:
- Target length: e.g., 1,024–4,096 tokens depending on model/context.
- Overlap: 50–200 tokens can preserve continuity for long narratives.
- Respect blocks: Don’t split inside code fences or tables; close blocks before cutting.
- Add markers: Use
<|doc|>
or end-of-turn tokens to teach structure.
Metadata that unlocks control
Attach fields like source
, domain
, lang
, license
, timestamp
, quality_score
, pii_redacted
. This enables:
- Mixture weighting (e.g., upweight docs and downweight forums).
- Auditable provenance for governance and takedowns.
- Targeted evaluation slices by domain or timeframe.
Instruction fine-tuning specifics
- Consistency: One schema and punctuation style for answers.
- Coverage: Include reasoning, tools, code, and error handling patterns your product needs.
- Difficulty curriculum: Start simple, include moderately hard tasks, avoid contrived corner cases unless product-relevant.
- No leakage: Keep eval-like prompts out of training.
Train/validation/test splits the right way
- Split by document, not by chunk, to avoid leakage.
- Deduplicate across the entire corpus before splitting.
- Stratify by source/language/length so each split mirrors production.
Governance and safety
- Licensing: Track licenses and respect usage terms.
- PII: Redact or remove. Consider hashing or placeholder tokens for structured IDs.
- Safety categories: Label sensitive content to control sampling or train refusal behavior.
A practical pipeline you can adapt
# pip install datasets beautifulsoup4 ftfy langdetect datasketch transformers
from datasets import load_dataset, Dataset
from bs4 import BeautifulSoup
from langdetect import detect
from datasketch import MinHash, MinHashLSH
from transformers import AutoTokenizer
import unicodedata, hashlib, json, re
# 1) Load raw HTML/text data (example: local JSONL with {"html":..., "url":...})
raw = load_dataset("json", data_files={"train": "raw.jsonl"})["train"]
def html_to_text(html):
soup = BeautifulSoup(html, "html.parser")
# remove obvious boilerplate
for tag in soup(["script", "style", "nav", "footer", "header"]):
tag.decompose()
text = soup.get_text("\n")
# normalize whitespace
text = re.sub(r"\n{3,}", "\n\n", text)
text = re.sub(r"[ \t]{2,}", " ", text)
return text.strip()
def normalize_unicode(s):
s = unicodedata.normalize("NFKC", s)
s = s.replace("\r\n", "\n").replace("\r", "\n")
return s
def clean_record(rec):
txt = rec.get("text") or html_to_text(rec.get("html", ""))
txt = normalize_unicode(txt)
if len(txt) < 100: # length filter
return None
try:
lang = detect(txt)
except Exception:
lang = "unk"
if lang != "en": # keep only English for this example
return None
return {
"text": f"<|doc|>\n{txt}\n",
"source": rec.get("url", "unknown"),
"lang": lang,
}
cleaned = [c for c in (clean_record(r) for r in raw) if c]
# 2) Exact dedup by normalized hash
seen = set()
unique = []
for r in cleaned:
h = hashlib.sha1(r["text"].encode("utf-8")).hexdigest()
if h not in seen:
seen.add(h)
unique.append(r)
# 3) Near-duplicate removal with MinHash on 5-gram word shingles
lsh = MinHashLSH(threshold=0.8, num_perm=64)
minhashes = []
for i, r in enumerate(unique):
m = MinHash(num_perm=64)
tokens = re.findall(r"\w+", r["text"].lower())
shingles = [" ".join(tokens[j:j+5]) for j in range(max(1, len(tokens)-4))]
for s in shingles:
m.update(s.encode("utf-8"))
lsh.insert(str(i), m)
minhashes.append(m)
keep_mask = [True]*len(unique)
for i, m in enumerate(minhashes):
if not keep_mask[i]:
continue
near = lsh.query(m)
for j in near:
j = int(j)
if j > i:
keep_mask[j] = False
deduped = [r for r, k in zip(unique, keep_mask) if k]
# 4) Token-aware chunking
model_id = "meta-llama/Llama-3.1-8B" # example; use your target tokenizer
_tok = AutoTokenizer.from_pretrained(model_id)
MAX_TOK = 2048
OVERLAP = 128
def chunk_text(record):
text = record["text"]
ids = _tok.encode(text)
chunks = []
start = 0
while start < len(ids):
end = min(start + MAX_TOK, len(ids))
piece = _tok.decode(ids[start:end], skip_special_tokens=False)
chunks.append({
"text": piece,
"source": record["source"],
"lang": record["lang"]
})
if end == len(ids):
break
start = end - OVERLAP # maintain continuity
return chunks
chunked = []
for r in deduped:
chunked.extend(chunk_text(r))
# 5) Train/val/test split by source (document-level)
from collections import defaultdict
by_source = defaultdict(list)
for r in chunked:
by_source[r["source"]].append(r)
sources = list(by_source.keys())
import random
random.seed(42)
random.shuffle(sources)
n = len(sources)
train_s, val_s, test_s = sources[:int(0.9*n)], sources[int(0.9*n):int(0.95*n)], sources[int(0.95*n):]
def gather(sources):
out = []
for s in sources:
out.extend(by_source[s])
return out
train, val, test = gather(train_s), gather(val_s), gather(test_s)
# 6) Save JSONL
for name, data in [("train.jsonl", train), ("val.jsonl", val), ("test.jsonl", test)]:
with open(name, "w", encoding="utf-8") as f:
for r in data:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
print("Saved chunks:", len(train), len(val), len(test))
Applying a chat template for instruction data
from transformers import AutoTokenizer
import json
# Suppose each record has a `messages` array as shown earlier
chat_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
def render_chat(record):
rendered = chat_tokenizer.apply_chat_template(
record["messages"], tokenize=False, add_generation_prompt=False
)
return {"text": rendered, **{k: v for k, v in record.items() if k != "messages"}}
with open("sft_raw.jsonl", "r", encoding="utf-8") as inp, open("sft_prepared.jsonl", "w", encoding="utf-8") as out:
for line in inp:
rec = json.loads(line)
out.write(json.dumps(render_chat(rec), ensure_ascii=False) + "\n")
Common pitfalls to avoid
- Token-agnostic chunking: Splitting by characters causes variable token lengths and broken structures.
- Inconsistent schemas: Mixing prompt/completion with chat-format without clear separators confuses the model.
- Leakage: Randomly splitting chunks rather than documents leaks content across train and eval.
- Overlapping duplicates: Dedup only at document level misses repeated paragraphs and boilerplate.
- Missing special tokens: Not using the tokenizer’s chat template leads to mismatched turn boundaries.
- Ignoring licensing/PII: Legal and privacy risks can derail deployments.
A quick checklist
- Define your dataset shape: raw text vs chat messages.
- Normalize Unicode, strip boilerplate, and fix whitespace.
- Filter by language and length; redact PII.
- Deduplicate exact and near-duplicates across the whole corpus.
- Chunk by tokens with sensible overlap; mark document boundaries.
- Attach useful metadata for weighting and audits.
- Split by document for train/val/test and verify no leakage.
- Use chat templates for instruction data; verify special tokens.
- Log every transform for reproducibility and rollback.
Closing thoughts
LLM quality is a direct reflection of the tokens you feed the model. With consistent schemas, careful normalization, robust deduplication, and token-aware chunking, you’ll get cleaner signals, smoother training, and better production behavior. Start simple, instrument your pipeline, and iterate with small evaluations. Your future self—and your model—will thank you.
Discover more from CPI Consulting
Subscribe to get the latest posts sent to your email.