Monday, July 22, 2024

How to Implement Semantic Search Without a PhD

Abdellatif Abdelfattah

How to Implement Semantic Search Without a PhD

Semantic search has transformed how we find information in large document collections, moving beyond simple keyword matching to understanding the actual meaning behind our queries. While implementing semantic search used to require advanced machine learning knowledge, modern tools and libraries have made it accessible to developers of all skill levels.

Understanding Semantic Search

At its core, semantic search works by converting text into vector representations (embeddings) that capture meaning. Similar meanings result in similar vectors, allowing us to find relevant documents by measuring vector similarity.

The key components of a semantic search system include:

  1. Embedding model: Converts text to vectors
  2. Vector database: Stores and enables efficient search of these vectors
  3. Retrieval mechanism: Finds the most similar vectors to a query
  4. Ranking system: Orders results by relevance

A Simple Implementation in Python

Let's build a practical semantic search system using widely available tools:

import os
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

class SimpleSemanticSearch:
    def __init__(self, embedding_model="text-embedding-3-small"):
        self.embedding_model = embedding_model
        self.documents = []
        self.embeddings = []
        
    def add_documents(self, documents):
        """Add documents to the search index"""
        self.documents.extend(documents)
        
        # Generate embeddings for new documents
        new_embeddings = []
        for doc in tqdm(documents, desc="Embedding documents"):
            embedding = self._get_embedding(doc)
            new_embeddings.append(embedding)
        
        self.embeddings.extend(new_embeddings)
        print(f"Added {len(documents)} documents to index")
        
    def search(self, query, top_k=5):
        """Search for documents similar to the query"""
        query_embedding = self._get_embedding(query)
        
        # Calculate similarity between query and all documents
        similarities = cosine_similarity(
            [query_embedding], 
            self.embeddings
        )[0]
        
        # Get indices of top-k most similar documents
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        # Return top-k documents and their similarity scores
        results = []
        for idx in top_indices:
            results.append({
                "document": self.documents[idx],
                "similarity": similarities[idx]
            })
        
        return results
    
    def _get_embedding(self, text):
        """Generate embedding for a text using OpenAI API"""
        response = client.embeddings.create(
            model=self.embedding_model,
            input=text
        )
        return response.data[0].embedding

# Example usage
if __name__ == "__main__":
    # Sample documents (in a real scenario, these might be loaded from files)
    documents = [
        "Machine learning is a subfield of artificial intelligence that focuses on data and algorithms.",
        "Natural language processing allows computers to understand human language.",
        "Deep learning uses neural networks with many layers.",
        "Python is a popular programming language for data science and AI.",
        "Vector databases store and retrieve high-dimensional vectors efficiently."
    ]
    
    # Create search engine and add documents
    search_engine = SimpleSemanticSearch()
    search_engine.add_documents(documents)
    
    # Search example
    query = "How do computers understand text?"
    results = search_engine.search(query, top_k=2)
    
    print(f"\nSearch results for: '{query}'")
    for i, result in enumerate(results, 1):
        print(f"{i}. {result['document']} (Similarity: {result['similarity']:.4f})")

This code creates a simple but functional semantic search engine that:

  1. Takes a collection of documents
  2. Generates embeddings using OpenAI's model
  3. Performs similarity search on those embeddings
  4. Returns the most relevant documents for a query

Scaling Up: Vector Databases

For production systems with larger document collections, you'll want to use a proper vector database instead of the in-memory approach above. Popular options include:

  • Pinecone: Fully managed vector database with great performance
  • Weaviate: Open-source vector search engine with filtering capabilities
  • Qdrant: Self-hostable vector search with extensive filtering options
  • Chroma: Lightweight embedding database designed for RAG applications

Here's how you might modify our example to use Chroma:

import chromadb
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Initialize Chroma client
chroma_client = chromadb.Client()

# Create or get a collection
collection = chroma_client.create_collection(
    name="document_collection",
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

def get_embedding(text, model="text-embedding-3-small"):
    """Generate embeddings using OpenAI API"""
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

# Sample documents with IDs
documents = [
    "Machine learning is a subfield of artificial intelligence that focuses on data and algorithms.",
    "Natural language processing allows computers to understand human language.",
    "Deep learning uses neural networks with many layers.",
    "Python is a popular programming language for data science and AI.",
    "Vector databases store and retrieve high-dimensional vectors efficiently."
]

# Add documents to the collection
document_ids = [f"doc_{i}" for i in range(len(documents))]
embeddings = [get_embedding(doc) for doc in documents]

collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=document_ids
)

# Perform a search
query = "How do computers understand text?"
query_embedding = get_embedding(query)

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=2
)

print(f"Search results for: '{query}'")
for i, (doc_id, document, distance) in enumerate(zip(
        results['ids'][0], results['documents'][0], results['distances'][0]
    ), 1):
    similarity = 1 - distance  # Convert distance to similarity
    print(f"{i}. {document} (Similarity: {similarity:.4f})")

Handling Large Documents: Chunking

Real-world documents are often long and need to be broken into smaller pieces (chunks) before embedding. A simple approach:

def chunk_text(text, chunk_size=1000, overlap=200):
    """Split text into overlapping chunks of approximately chunk_size characters"""
    if len(text) <= chunk_size:
        return [text]
    
    chunks = []
    start = 0
    
    while start < len(text):
        # Find the end of the chunk
        end = start + chunk_size
        
        # If we're not at the end of the text, try to find a natural break point
        if end < len(text):
            # Look for the last period, question mark, or exclamation point
            for punct in ['. ', '? ', '! ']:
                last_punct = text[start:end].rfind(punct)
                if last_punct != -1:
                    end = start + last_punct + 2  # +2 to include the punctuation and space
                    break
        
        # Add the chunk to our list
        chunks.append(text[start:end])
        
        # Move the start position, accounting for overlap
        start = end - overlap
    
    return chunks

Improving Search Quality

To enhance your semantic search, consider these techniques:

  1. Hybrid search: Combine semantic search with keyword matching for better results on specific terms

  2. Re-ranking: Use a more powerful model to re-rank the top results from your initial search

  3. Query expansion: Automatically expand queries to include related terms or concepts

  4. Filtering: Allow users to filter results by metadata (date, author, category, etc.)

Conclusion

Building a semantic search system no longer requires advanced degrees or specialized knowledge. With modern tools and APIs, you can implement sophisticated semantic search capabilities in a few hundred lines of code.

The examples provided here give you a starting point that you can build upon, customize, and scale according to your specific needs. As you grow more comfortable with these concepts, you can explore more advanced techniques and optimizations to create an even more powerful search experience.

Remember that the quality of your embeddings plays a crucial role in search performance, so experimenting with different embedding models can yield significant improvements as your system evolves.