Monday, April 1, 2024

Embeddings 101: Representing Text as Vectors

Abdellatif Abdelfattah

Embeddings 101: Representing Text as Vectors

Embeddings have become a foundational technology in modern natural language processing (NLP). These numerical representations of text enable machines to capture semantic relationships and perform operations on language that would otherwise be impossible. In this guide, we'll explore what embeddings are, how they work, and how to use them effectively.

What Are Embeddings?

At their core, embeddings are dense vectors (lists of numbers) that represent the meaning of a piece of text. Unlike simpler representations like one-hot encoding, embeddings capture semantic relationships in a continuous vector space, where:

  • Similar texts have similar vectors (closer in the vector space)
  • Different meanings of the same word can have different representations based on context
  • Semantic relationships can be discovered through vector operations

A properly trained embedding model might place the vectors for "king" and "queen" close together, while placing both of them at a similar distance from "man" and "woman" respectively.

The Semantic Space

Embedding models map words or phrases into a high-dimensional space (typically anywhere from 100 to 1,000+ dimensions), where the position of each item encodes its meaning. In this space:

  • Distance represents semantic dissimilarity
  • Direction can represent relationships between concepts
  • Clustering reveals groups of related terms

This geometric interpretation allows for powerful operations like discovering analogies. The classic example is: "king" - "man" + "woman" ≈ "queen", which can be calculated directly with the vector representations.

Creating Text Embeddings

There are several approaches to generating embeddings:

1. Word-Level Embeddings

Word2Vec, GloVe, and FastText are popular word embedding techniques that learn vector representations based on word co-occurrence patterns.

# Example: Using Gensim to train Word2Vec embeddings
from gensim.models import Word2Vec

# Sample corpus (normally this would be much larger)
corpus = [
    ["king", "queen", "palace", "royal", "crown"],
    ["man", "woman", "person", "human", "people"],
    ["cat", "dog", "pet", "animal", "fur"]
]

# Train the model
model = Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4)

# Access embeddings
king_vector = model.wv['king']
print(f"Shape of 'king' vector: {king_vector.shape}")

# Find similar words
similar_words = model.wv.most_similar('queen', topn=3)
print(f"Words similar to 'queen': {similar_words}")

2. Contextual Embeddings

More modern approaches like BERT, RoBERTa, and GPT models generate contextual embeddings, where the same word can have different vectors depending on its context.

# Example: Using Hugging Face Transformers to get BERT embeddings
from transformers import AutoTokenizer, AutoModel
import torch

# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Prepare sentences
sentences = [
    "The bank is closed today.",
    "The river bank is muddy."
]

# Tokenize and get embeddings
for sentence in sentences:
    # Tokenize and prepare for model
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    
    # Get model output
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Use the [CLS] token embedding as sentence representation
    sentence_embedding = outputs.last_hidden_state[:, 0, :].numpy()
    print(f"Shape of embedding for '{sentence}': {sentence_embedding.shape}")

3. Sentence and Document Embeddings

For longer pieces of text, specialized models like Sentence-BERT (SBERT) or OpenAI's text-embedding models can generate embeddings that capture the meaning of entire sentences or documents.

# Example: Using Sentence Transformers for sentence embeddings
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample sentences
sentences = [
    "This is an example sentence for embedding.",
    "Each sentence will be converted to a vector.",
    "Similar sentences should have similar embeddings."
]

# Generate embeddings
embeddings = model.encode(sentences)

# Print shape information
print(f"Number of sentences: {len(sentences)}")
print(f"Shape of embeddings: {embeddings.shape}")

Working with Embeddings

Once you have embeddings, you can use them for various NLP tasks:

Semantic Search

Embeddings enable searching for content based on meaning rather than just keywords:

import numpy as np
from sentence_transformers import SentenceTransformer, util

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus of documents
corpus = [
    "Artificial intelligence is transforming industries.",
    "Machine learning models require significant data.",
    "Natural language processing helps computers understand text.",
    "Deep learning is a subset of machine learning.",
    "Computer vision allows machines to interpret visual information."
]

# Convert corpus to embeddings
corpus_embeddings = model.encode(corpus)

# Query
query = "How do computers understand language?"
query_embedding = model.encode(query)

# Calculate similarity
similarities = util.cos_sim(query_embedding, corpus_embeddings)[0]

# Get top results
top_results = np.argsort(-similarities).tolist()
for idx in top_results[:2]:
    print(f"Similarity: {similarities[idx]:.4f} | {corpus[idx]}")

Text Classification

Embeddings can serve as features for classification models:

from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample dataset
texts = [
    "I love this product, it works great!",
    "The customer service was excellent",
    "This is the worst purchase I've ever made",
    "Completely disappointed with the quality",
    "Highly recommend this to everyone",
    "Terrible experience, would not buy again"
]
labels = [1, 1, 0, 0, 1, 0]  # 1 = positive, 0 = negative

# Generate embeddings
embeddings = model.encode(texts)

# Train a classifier
classifier = LogisticRegression()
classifier.fit(embeddings, labels)

# New examples
new_texts = [
    "I'm happy with my purchase",
    "Don't waste your money on this"
]
new_embeddings = model.encode(new_texts)

# Predict
predictions = classifier.predict(new_embeddings)
for text, pred in zip(new_texts, predictions):
    sentiment = "positive" if pred == 1 else "negative"
    print(f"Text: '{text}' | Predicted sentiment: {sentiment}")

Clustering and Topic Modeling

Embeddings can help discover patterns and group similar content:

from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample documents
documents = [
    "The economy grew by 3% last quarter according to reports.",
    "Stock markets reached new highs amid positive earnings.",
    "Scientists discover new species in the Amazon rainforest.",
    "Conservation efforts help protect endangered wildlife.",
    "New smartphone features advanced camera technology.",
    "Tech company releases innovative wearable devices."
]

# Generate embeddings
embeddings = model.encode(documents)

# Cluster documents
num_clusters = 3
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_

# Group documents by cluster
clustered_documents = [[] for _ in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_documents[cluster_id].append(documents[sentence_id])

# Print clusters
for i, cluster in enumerate(clustered_documents):
    print(f"Cluster {i+1}:")
    for doc in cluster:
        print(f"  - {doc}")
    print()

Evaluating Embedding Quality

The quality of embeddings can be evaluated in several ways:

Intrinsic Evaluation

These methods directly test properties of the embeddings:

# Example: Word analogy tasks
def test_analogy(model, a, b, c, expected):
    """Test if a:b :: c:? = expected"""
    result = model.most_similar(positive=[b, c], negative=[a], topn=1)
    print(f"{a} is to {b} as {c} is to {result[0][0]} (expected: {expected})")
    return result[0][0] == expected

# Example: Semantic similarity correlations
from scipy.stats import spearmanr

def evaluate_similarity_correlation(model, similarity_dataset):
    """Evaluate correlation between embedding similarity and human judgments"""
    human_scores = []
    model_scores = []
    
    for item in similarity_dataset:
        word1, word2, human_score = item
        # Calculate cosine similarity between word vectors
        model_score = model.similarity(word1, word2)
        
        human_scores.append(human_score)
        model_scores.append(model_score)
    
    correlation, p_value = spearmanr(human_scores, model_scores)
    print(f"Spearman correlation: {correlation:.4f} (p-value: {p_value:.4f})")
    return correlation

Extrinsic Evaluation

These methods test embeddings on downstream tasks:

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

def evaluate_on_classification(embeddings, labels, test_size=0.3):
    """Evaluate embeddings on a classification task"""
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        embeddings, labels, test_size=test_size, random_state=42
    )
    
    # Train classifier
    classifier = LogisticRegression(max_iter=1000)
    classifier.fit(X_train, y_train)
    
    # Evaluate
    predictions = classifier.predict(X_test)
    report = classification_report(y_test, predictions)
    print(report)
    
    return classifier

Choosing the Right Embedding Model

There are several factors to consider when selecting an embedding model:

1. Task Requirements

Different tasks have different needs:

  • Semantic similarity: Models like Sentence-BERT excel at capturing similarity between texts
  • Classification/NER: Contextual models like BERT or RoBERTa often perform well
  • Multilingual tasks: Consider models trained on multiple languages
  • Domain-specific needs: Some models are fine-tuned for specific fields like medicine or law

2. Performance Considerations

Practical concerns include:

  • Embedding dimension: Higher dimensions can capture more information but require more storage
  • Computation time: Larger models produce better embeddings but are slower to run
  • Memory usage: Some models require significant RAM, especially for large batches

3. Using Pre-trained vs. Custom Embeddings

You can choose between:

  • Pre-trained models: Ready-to-use embeddings from providers like OpenAI or Hugging Face
  • Fine-tuned models: Adapting existing models to your specific domain
  • Custom-trained models: Building embeddings from scratch for your data
# Example: Using OpenAI's embeddings
import openai
import os

openai.api_key = os.environ.get("OPENAI_API_KEY")

def get_openai_embedding(text):
    """Get embeddings from OpenAI API"""
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Example texts
texts = ["Understanding embeddings is essential for modern NLP",
         "Vector representations enable semantic search and more"]

# Get embeddings
embeddings = [get_openai_embedding(text) for text in texts]
print(f"Embedding dimension: {len(embeddings[0])}")

Best Practices for Working with Embeddings

To get the most out of embeddings in your applications:

1. Preprocessing

Text preprocessing can significantly impact embedding quality:

import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def preprocess_text(text, remove_stopwords=False):
    """Basic text preprocessing for embeddings"""
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Optionally remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        text = ' '.join([word for word in text.split() if word not in stop_words])
    
    return text

2. Caching Embeddings

Computing embeddings can be expensive, so caching them is often beneficial:

import os
import json
import hashlib
import numpy as np

class EmbeddingCache:
    """Simple cache for embeddings"""
    def __init__(self, cache_dir="embedding_cache"):
        self.cache_dir = cache_dir
        if not os.path.exists(cache_dir):
            os.makedirs(cache_dir)
    
    def _get_cache_key(self, text, model_name):
        """Generate a cache key from text and model name"""
        text_hash = hashlib.md5(text.encode()).hexdigest()
        return f"{model_name}_{text_hash}.npy"
    
    def get(self, text, model_name):
        """Retrieve embedding from cache if available"""
        key = self._get_cache_key(text, model_name)
        cache_path = os.path.join(self.cache_dir, key)
        
        if os.path.exists(cache_path):
            return np.load(cache_path)
        return None
    
    def save(self, text, embedding, model_name):
        """Save embedding to cache"""
        key = self._get_cache_key(text, model_name)
        cache_path = os.path.join(self.cache_dir, key)
        np.save(cache_path, embedding)

3. Dimensionality Reduction

For visualizing or improving efficiency, you might reduce embedding dimensions:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

def visualize_embeddings(embeddings, labels):
    """Visualize embeddings in 2D using PCA"""
    # Reduce to 2 dimensions
    pca = PCA(n_components=2)
    reduced_embeddings = pca.fit_transform(embeddings)
    
    # Plot
    plt.figure(figsize=(10, 8))
    unique_labels = set(labels)
    colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))
    
    for label, color in zip(unique_labels, colors):
        indices = [i for i, l in enumerate(labels) if l == label]
        plt.scatter(
            reduced_embeddings[indices, 0],
            reduced_embeddings[indices, 1],
            c=[color],
            label=label
        )
    
    plt.legend()
    plt.title("Embedding visualization")
    plt.xlabel("PCA dimension 1")
    plt.ylabel("PCA dimension 2")
    plt.show()

Advanced Topics

As you become more comfortable with embeddings, consider exploring:

Hybrid Search

Combining embeddings with traditional keyword search for better results:

def hybrid_search(query, documents, model, keyword_weight=0.3):
    """Combine semantic and keyword search"""
    # Semantic search
    query_embedding = model.encode(query)
    document_embeddings = model.encode(documents)
    
    semantic_scores = []
    for doc_emb in document_embeddings:
        # Calculate cosine similarity
        similarity = np.dot(query_embedding, doc_emb) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
        )
        semantic_scores.append(similarity)
    
    # Keyword search (simple implementation)
    keyword_scores = []
    query_terms = set(query.lower().split())
    for doc in documents:
        doc_terms = set(doc.lower().split())
        # Jaccard similarity
        intersection = len(query_terms.intersection(doc_terms))
        union = len(query_terms.union(doc_terms))
        score = intersection / union if union > 0 else 0
        keyword_scores.append(score)
    
    # Combine scores
    combined_scores = [
        (1 - keyword_weight) * sem_score + keyword_weight * kw_score
        for sem_score, kw_score in zip(semantic_scores, keyword_scores)
    ]
    
    # Return indices sorted by combined score
    return sorted(range(len(combined_scores)), key=lambda i: combined_scores[i], reverse=True)

Fine-tuning Embedding Models

Adapting pre-trained models to specific domains:

# Simplified example of fine-tuning Sentence-BERT
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

def fine_tune_embedding_model(model_name, train_examples, epochs=3):
    """Fine-tune a sentence embedding model on custom data"""
    # Load base model
    model = SentenceTransformer(model_name)
    
    # Prepare training examples
    train_data = [
        InputExample(texts=[ex[0], ex[1]], label=float(ex[2]))
        for ex in train_examples
    ]
    
    # Define dataloader
    train_dataloader = DataLoader(train_data, shuffle=True, batch_size=16)
    
    # Define loss
    train_loss = losses.CosineSimilarityLoss(model)
    
    # Train model
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=epochs,
        warmup_steps=100,
        show_progress_bar=True
    )
    
    return model

Conclusion

Text embeddings have revolutionized natural language processing by providing a way to represent semantic meaning in a format computers can process. From simple word vectors to powerful contextual embeddings, these representations enable a wide range of applications including semantic search, classification, clustering, and much more.

As you implement embeddings in your projects, remember that the choice of model and preprocessing steps should align with your specific use case. By following best practices and understanding the fundamentals outlined in this guide, you'll be well-equipped to leverage the power of embeddings in your NLP applications.