Embeddings 101: Representing Text as Vectors
Embeddings have become a foundational technology in modern natural language processing (NLP). These numerical representations of text enable machines to capture semantic relationships and perform operations on language that would otherwise be impossible. In this guide, we'll explore what embeddings are, how they work, and how to use them effectively.
What Are Embeddings?
At their core, embeddings are dense vectors (lists of numbers) that represent the meaning of a piece of text. Unlike simpler representations like one-hot encoding, embeddings capture semantic relationships in a continuous vector space, where:
- Similar texts have similar vectors (closer in the vector space)
- Different meanings of the same word can have different representations based on context
- Semantic relationships can be discovered through vector operations
A properly trained embedding model might place the vectors for "king" and "queen" close together, while placing both of them at a similar distance from "man" and "woman" respectively.
The Semantic Space
Embedding models map words or phrases into a high-dimensional space (typically anywhere from 100 to 1,000+ dimensions), where the position of each item encodes its meaning. In this space:
- Distance represents semantic dissimilarity
- Direction can represent relationships between concepts
- Clustering reveals groups of related terms
This geometric interpretation allows for powerful operations like discovering analogies. The classic example is: "king" - "man" + "woman" ≈ "queen", which can be calculated directly with the vector representations.
Creating Text Embeddings
There are several approaches to generating embeddings:
1. Word-Level Embeddings
Word2Vec, GloVe, and FastText are popular word embedding techniques that learn vector representations based on word co-occurrence patterns.
# Example: Using Gensim to train Word2Vec embeddings
from gensim.models import Word2Vec
# Sample corpus (normally this would be much larger)
corpus = [
["king", "queen", "palace", "royal", "crown"],
["man", "woman", "person", "human", "people"],
["cat", "dog", "pet", "animal", "fur"]
]
# Train the model
model = Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4)
# Access embeddings
king_vector = model.wv['king']
print(f"Shape of 'king' vector: {king_vector.shape}")
# Find similar words
similar_words = model.wv.most_similar('queen', topn=3)
print(f"Words similar to 'queen': {similar_words}")
2. Contextual Embeddings
More modern approaches like BERT, RoBERTa, and GPT models generate contextual embeddings, where the same word can have different vectors depending on its context.
# Example: Using Hugging Face Transformers to get BERT embeddings
from transformers import AutoTokenizer, AutoModel
import torch
# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Prepare sentences
sentences = [
"The bank is closed today.",
"The river bank is muddy."
]
# Tokenize and get embeddings
for sentence in sentences:
# Tokenize and prepare for model
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
# Get model output
with torch.no_grad():
outputs = model(**inputs)
# Use the [CLS] token embedding as sentence representation
sentence_embedding = outputs.last_hidden_state[:, 0, :].numpy()
print(f"Shape of embedding for '{sentence}': {sentence_embedding.shape}")
3. Sentence and Document Embeddings
For longer pieces of text, specialized models like Sentence-BERT (SBERT) or OpenAI's text-embedding models can generate embeddings that capture the meaning of entire sentences or documents.
# Example: Using Sentence Transformers for sentence embeddings
from sentence_transformers import SentenceTransformer
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample sentences
sentences = [
"This is an example sentence for embedding.",
"Each sentence will be converted to a vector.",
"Similar sentences should have similar embeddings."
]
# Generate embeddings
embeddings = model.encode(sentences)
# Print shape information
print(f"Number of sentences: {len(sentences)}")
print(f"Shape of embeddings: {embeddings.shape}")
Working with Embeddings
Once you have embeddings, you can use them for various NLP tasks:
Semantic Search
Embeddings enable searching for content based on meaning rather than just keywords:
import numpy as np
from sentence_transformers import SentenceTransformer, util
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Corpus of documents
corpus = [
"Artificial intelligence is transforming industries.",
"Machine learning models require significant data.",
"Natural language processing helps computers understand text.",
"Deep learning is a subset of machine learning.",
"Computer vision allows machines to interpret visual information."
]
# Convert corpus to embeddings
corpus_embeddings = model.encode(corpus)
# Query
query = "How do computers understand language?"
query_embedding = model.encode(query)
# Calculate similarity
similarities = util.cos_sim(query_embedding, corpus_embeddings)[0]
# Get top results
top_results = np.argsort(-similarities).tolist()
for idx in top_results[:2]:
print(f"Similarity: {similarities[idx]:.4f} | {corpus[idx]}")
Text Classification
Embeddings can serve as features for classification models:
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
import numpy as np
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample dataset
texts = [
"I love this product, it works great!",
"The customer service was excellent",
"This is the worst purchase I've ever made",
"Completely disappointed with the quality",
"Highly recommend this to everyone",
"Terrible experience, would not buy again"
]
labels = [1, 1, 0, 0, 1, 0] # 1 = positive, 0 = negative
# Generate embeddings
embeddings = model.encode(texts)
# Train a classifier
classifier = LogisticRegression()
classifier.fit(embeddings, labels)
# New examples
new_texts = [
"I'm happy with my purchase",
"Don't waste your money on this"
]
new_embeddings = model.encode(new_texts)
# Predict
predictions = classifier.predict(new_embeddings)
for text, pred in zip(new_texts, predictions):
sentiment = "positive" if pred == 1 else "negative"
print(f"Text: '{text}' | Predicted sentiment: {sentiment}")
Clustering and Topic Modeling
Embeddings can help discover patterns and group similar content:
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample documents
documents = [
"The economy grew by 3% last quarter according to reports.",
"Stock markets reached new highs amid positive earnings.",
"Scientists discover new species in the Amazon rainforest.",
"Conservation efforts help protect endangered wildlife.",
"New smartphone features advanced camera technology.",
"Tech company releases innovative wearable devices."
]
# Generate embeddings
embeddings = model.encode(documents)
# Cluster documents
num_clusters = 3
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_
# Group documents by cluster
clustered_documents = [[] for _ in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_documents[cluster_id].append(documents[sentence_id])
# Print clusters
for i, cluster in enumerate(clustered_documents):
print(f"Cluster {i+1}:")
for doc in cluster:
print(f" - {doc}")
print()
Evaluating Embedding Quality
The quality of embeddings can be evaluated in several ways:
Intrinsic Evaluation
These methods directly test properties of the embeddings:
# Example: Word analogy tasks
def test_analogy(model, a, b, c, expected):
"""Test if a:b :: c:? = expected"""
result = model.most_similar(positive=[b, c], negative=[a], topn=1)
print(f"{a} is to {b} as {c} is to {result[0][0]} (expected: {expected})")
return result[0][0] == expected
# Example: Semantic similarity correlations
from scipy.stats import spearmanr
def evaluate_similarity_correlation(model, similarity_dataset):
"""Evaluate correlation between embedding similarity and human judgments"""
human_scores = []
model_scores = []
for item in similarity_dataset:
word1, word2, human_score = item
# Calculate cosine similarity between word vectors
model_score = model.similarity(word1, word2)
human_scores.append(human_score)
model_scores.append(model_score)
correlation, p_value = spearmanr(human_scores, model_scores)
print(f"Spearman correlation: {correlation:.4f} (p-value: {p_value:.4f})")
return correlation
Extrinsic Evaluation
These methods test embeddings on downstream tasks:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
def evaluate_on_classification(embeddings, labels, test_size=0.3):
"""Evaluate embeddings on a classification task"""
# Split data
X_train, X_test, y_train, y_test = train_test_split(
embeddings, labels, test_size=test_size, random_state=42
)
# Train classifier
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train, y_train)
# Evaluate
predictions = classifier.predict(X_test)
report = classification_report(y_test, predictions)
print(report)
return classifier
Choosing the Right Embedding Model
There are several factors to consider when selecting an embedding model:
1. Task Requirements
Different tasks have different needs:
- Semantic similarity: Models like Sentence-BERT excel at capturing similarity between texts
- Classification/NER: Contextual models like BERT or RoBERTa often perform well
- Multilingual tasks: Consider models trained on multiple languages
- Domain-specific needs: Some models are fine-tuned for specific fields like medicine or law
2. Performance Considerations
Practical concerns include:
- Embedding dimension: Higher dimensions can capture more information but require more storage
- Computation time: Larger models produce better embeddings but are slower to run
- Memory usage: Some models require significant RAM, especially for large batches
3. Using Pre-trained vs. Custom Embeddings
You can choose between:
- Pre-trained models: Ready-to-use embeddings from providers like OpenAI or Hugging Face
- Fine-tuned models: Adapting existing models to your specific domain
- Custom-trained models: Building embeddings from scratch for your data
# Example: Using OpenAI's embeddings
import openai
import os
openai.api_key = os.environ.get("OPENAI_API_KEY")
def get_openai_embedding(text):
"""Get embeddings from OpenAI API"""
response = openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Example texts
texts = ["Understanding embeddings is essential for modern NLP",
"Vector representations enable semantic search and more"]
# Get embeddings
embeddings = [get_openai_embedding(text) for text in texts]
print(f"Embedding dimension: {len(embeddings[0])}")
Best Practices for Working with Embeddings
To get the most out of embeddings in your applications:
1. Preprocessing
Text preprocessing can significantly impact embedding quality:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def preprocess_text(text, remove_stopwords=False):
"""Basic text preprocessing for embeddings"""
# Convert to lowercase
text = text.lower()
# Remove special characters
text = re.sub(r'[^\w\s]', '', text)
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Optionally remove stopwords
if remove_stopwords:
stop_words = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in stop_words])
return text
2. Caching Embeddings
Computing embeddings can be expensive, so caching them is often beneficial:
import os
import json
import hashlib
import numpy as np
class EmbeddingCache:
"""Simple cache for embeddings"""
def __init__(self, cache_dir="embedding_cache"):
self.cache_dir = cache_dir
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)
def _get_cache_key(self, text, model_name):
"""Generate a cache key from text and model name"""
text_hash = hashlib.md5(text.encode()).hexdigest()
return f"{model_name}_{text_hash}.npy"
def get(self, text, model_name):
"""Retrieve embedding from cache if available"""
key = self._get_cache_key(text, model_name)
cache_path = os.path.join(self.cache_dir, key)
if os.path.exists(cache_path):
return np.load(cache_path)
return None
def save(self, text, embedding, model_name):
"""Save embedding to cache"""
key = self._get_cache_key(text, model_name)
cache_path = os.path.join(self.cache_dir, key)
np.save(cache_path, embedding)
3. Dimensionality Reduction
For visualizing or improving efficiency, you might reduce embedding dimensions:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
def visualize_embeddings(embeddings, labels):
"""Visualize embeddings in 2D using PCA"""
# Reduce to 2 dimensions
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)
# Plot
plt.figure(figsize=(10, 8))
unique_labels = set(labels)
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))
for label, color in zip(unique_labels, colors):
indices = [i for i, l in enumerate(labels) if l == label]
plt.scatter(
reduced_embeddings[indices, 0],
reduced_embeddings[indices, 1],
c=[color],
label=label
)
plt.legend()
plt.title("Embedding visualization")
plt.xlabel("PCA dimension 1")
plt.ylabel("PCA dimension 2")
plt.show()
Advanced Topics
As you become more comfortable with embeddings, consider exploring:
Hybrid Search
Combining embeddings with traditional keyword search for better results:
def hybrid_search(query, documents, model, keyword_weight=0.3):
"""Combine semantic and keyword search"""
# Semantic search
query_embedding = model.encode(query)
document_embeddings = model.encode(documents)
semantic_scores = []
for doc_emb in document_embeddings:
# Calculate cosine similarity
similarity = np.dot(query_embedding, doc_emb) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
)
semantic_scores.append(similarity)
# Keyword search (simple implementation)
keyword_scores = []
query_terms = set(query.lower().split())
for doc in documents:
doc_terms = set(doc.lower().split())
# Jaccard similarity
intersection = len(query_terms.intersection(doc_terms))
union = len(query_terms.union(doc_terms))
score = intersection / union if union > 0 else 0
keyword_scores.append(score)
# Combine scores
combined_scores = [
(1 - keyword_weight) * sem_score + keyword_weight * kw_score
for sem_score, kw_score in zip(semantic_scores, keyword_scores)
]
# Return indices sorted by combined score
return sorted(range(len(combined_scores)), key=lambda i: combined_scores[i], reverse=True)
Fine-tuning Embedding Models
Adapting pre-trained models to specific domains:
# Simplified example of fine-tuning Sentence-BERT
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
def fine_tune_embedding_model(model_name, train_examples, epochs=3):
"""Fine-tune a sentence embedding model on custom data"""
# Load base model
model = SentenceTransformer(model_name)
# Prepare training examples
train_data = [
InputExample(texts=[ex[0], ex[1]], label=float(ex[2]))
for ex in train_examples
]
# Define dataloader
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=16)
# Define loss
train_loss = losses.CosineSimilarityLoss(model)
# Train model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=epochs,
warmup_steps=100,
show_progress_bar=True
)
return model
Conclusion
Text embeddings have revolutionized natural language processing by providing a way to represent semantic meaning in a format computers can process. From simple word vectors to powerful contextual embeddings, these representations enable a wide range of applications including semantic search, classification, clustering, and much more.
As you implement embeddings in your projects, remember that the choice of model and preprocessing steps should align with your specific use case. By following best practices and understanding the fundamentals outlined in this guide, you'll be well-equipped to leverage the power of embeddings in your NLP applications.