Understanding Word Embeddings

In this blog post Understanding Word Embeddings for Search, NLP, and Analytics we will unpack what embeddings are, how they work under the hood, and how your team can use them in real products without getting lost in jargon.

At a high level, a word embedding is a compact numerical representation of meaning. It turns words (or tokens) into vectors—lists of numbers—so that similar words sit close together in a geometric space. This simple idea powers smarter search, better classification, and robust analytics. Instead of matching exact strings, systems compare meanings using distances between vectors.

Think of embeddings as a map of language: “doctor” ends up near “physician”, “hospital”, and “nurse”. The map is learned from data, not hand-written. Once you have it, you can measure similarity, cluster topics, or feed the vectors into models that need language understanding.

What is a word embedding?

A word embedding assigns each token a dense vector like [0.12, -0.48, …]. Words that show up in similar contexts receive similar vectors. This follows the distributional hypothesis: words used in similar ways have related meanings. Unlike one-hot encodings, embeddings are low-dimensional (e.g., 100–1024 values) and capture semantic relationships.

Why embeddings matter

Search and retrieval: Find documents by meaning, not just keywords. Great for synonyms and misspellings.
Classification: Feed embeddings into models for intent detection, routing, or sentiment.
Clustering and analytics: Group similar texts, detect topics, and explore corpora.
Recommendation: Match queries to products, FAQs to answers, or tickets to solutions.
RAG for LLMs: Retrieve semantically relevant chunks to ground model responses.

The technology behind embeddings

Under the hood, embeddings are learned so that words that co-occur in similar contexts get closer in vector space. There are a few major families:

Predictive models (word2vec)

CBOW: Predict a target word from surrounding context words.
Skip-gram: Predict surrounding words from a target word. Works well for small data.

Training uses stochastic gradient descent and a softmax approximation such as negative sampling. Negative sampling trains the model to push real word–context pairs together and random pairs apart, which is efficient and yields smooth embeddings.

Count-based models (GloVe)

GloVe builds a large co-occurrence matrix (how often word i appears with word j) and factorises it. The factorisation compresses counts into dense vectors that preserve global statistics. It often captures linear relations like king − man + woman ≈ queen in static settings.

Subword-aware models (fastText)

fastText represents a word as a bag of character n-grams. This helps with rare words, typos, and morphologies (e.g., “connect”, “connected”, “connecting”) by sharing subword pieces. It reduces out-of-vocabulary issues in real systems.

Contextual embeddings (transformers like BERT)

Static embeddings give one vector per word type, regardless of sentence. Contextual embeddings give a different vector per occurrence using transformers. The word “bank” in “river bank” differs from “bank account”. Models such as BERT, RoBERTa, and modern LLMs produce token or sentence-level embeddings that adapt to context and usually perform best for search and retrieval.

How vectors are used

Similarity: Cosine similarity is the go-to metric. Values near 1 indicate high similarity.
Nearest neighbours: Find top-k closest vectors to a query for retrieval or recommendations.
Compositions: Average word vectors to represent a sentence or document (simple, surprisingly strong). For stronger results, use sentence embedding models trained for that purpose.

Practical steps to adopt embeddings

Choose an approach
- Need lightweight, explainable, and local? Use pretrained word2vec/GloVe/fastText.
- Need best relevance and robustness? Use contextual embeddings (e.g., BERT-based sentence models).
- Domain-specific language (legal, medical)? Fine-tune or adapt on in-domain text.
Prepare your data
- Clean text: normalise whitespace, standardise casing where appropriate, strip boilerplate.
- Tokenise consistently: whitespace + punctuation rules or a library tokenizer.
- Chunk long documents: 200–500 tokens per chunk works well for retrieval.
Train or download
- Download a reputable pretrained model to start fast.
- If training: set embedding dimension (100–768), context window (2–10), min count (e.g., 5), and optimise with negative sampling.
Evaluate
- Intrinsic: word similarity and analogy tests for sanity checks.
- Extrinsic: measure downstream KPIs (search click-through, F1 for classification).
- Bias and safety: audit for stereotypes and sensitive associations.
Deploy
- Serve vectors via an API or embed at indexing time.
- Use a vector database or ANN index (FAISS, ScaNN, Milvus) for fast similarity search.
- Version your models and embeddings; monitor drift and recalibrate as data evolves.

Quick code examples

Train and use word2vec with Gensim

# pip install gensim
from gensim.models import Word2Vec

sentences = [
    "the patient visited the hospital", 
    "a doctor works at the clinic",
    "the nurse helps the doctor",
    "the bank approved the loan",
    "the river overflowed the bank"
]

# Simple tokenisation
corpus = [s.split() for s in sentences]

# Train a small model
model = Word2Vec(
    sentences=corpus,
    vector_size=100,
    window=5,
    min_count=1,
    workers=2,
    sg=1,             # Skip-gram; set 0 for CBOW
    negative=5,
    epochs=50
)

# Find similar words
print(model.wv.most_similar('doctor', topn=5))

# Get a document embedding by averaging word vectors
import numpy as np

def doc_embedding(text):
    tokens = text.split()
    vecs = [model.wv[t] for t in tokens if t in model.wv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(model.vector_size)

q = "medical clinic"
print(doc_embedding(q)[:5])  # peek at first 5 dims

Contextual sentence embeddings with Transformers

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")  # small, fast, strong baseline

queries = ["how to reset my password", "bank account login help"]
docs = [
    "To reset your password, click 'Forgot Password' on the login page.",
    "River bank erosion is increasing after heavy rains.",
    "For bank account login issues, contact support with your ID."
]

q_emb = model.encode(queries, normalize_embeddings=True)
d_emb = model.encode(docs, normalize_embeddings=True)

# Compute cosine similarity
scores = util.cos_sim(q_emb, d_emb)  # shape: [len(queries), len(docs)]

# Rank docs for each query
for i, q in enumerate(queries):
    ranked = scores[i].tolist()
    order = sorted(range(len(docs)), key=lambda j: -ranked[j])
    print("Query:", q)
    for j in order:
        print(f"  {scores[i][j]:.3f}", docs[j])

Design choices that matter

Dimension: 100–300 is common for static embeddings; 384–1024 for sentence models. Higher dims capture nuance but cost memory and latency.
Context window: Small windows capture syntax; larger windows capture topics.
Tokenisation: Consistency is key. For multilingual or noisy data, subword models are robust.
Index choice: Approximate nearest neighbour (ANN) scales to millions of vectors with millisecond latency.

Limits and pitfalls

Polysemy: Static embeddings conflate senses (bank as shore vs finance). Contextual models fix this.
Out-of-vocabulary: Classic models fail on unseen words; subword/contextual models help.
Bias: Embeddings reflect training data. Audit and mitigate with debiasing, filters, and governance.
Domain drift: Over time, meanings shift. Re-embed content after model updates or major data changes.
Over-indexing on analogies: Vector arithmetic examples are illustrative, not guaranteed.

Operational tips

Version everything: model, tokeniser, vector index, and the data snapshot.
Cache embeddings: Precompute for documents; compute queries on demand.
Compression: Use float16 or 8-bit quantisation to cut memory; validate impact on quality.
Hybrid search: Combine keyword and vector scores for best relevance and explainability.
Monitoring: Track similarity distributions, retrieval diversity, and downstream KPIs.

A simple evaluation recipe

Create a labelled set of queries with relevant documents (gold set).
Index document embeddings; compute query embeddings.
Measure nDCG@k, Recall@k, and MRR. Compare to keyword-only baseline.
Run bias checks on sensitive terms and topics.
Stress-test with noisy queries, typos, and domain-specific jargon.

When to train your own vs reuse

Reuse a public model when your domain is general and latency or compute is tight.
Fine-tune when your data has unique language (medical, legal, fintech) or you need top-tier relevance.
Train from scratch only with large corpora and a clear gap in existing models.

Key takeaways

Embeddings turn text into vectors so systems can reason about meaning.
Word2vec/GloVe/fastText are light and useful; transformers deliver best relevance.
Evaluate on your real tasks, not just toy benchmarks.
Operational excellence—versioning, indexing, monitoring—is as important as the model.

If you’re planning vector search, RAG, or classification at scale, start small with a strong sentence model, measure impact, then decide whether domain adaptation is worth the investment. With thoughtful design, embeddings turn raw text into actionable signals that improve search, automation, and analytics.

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.