How to Implement Semantic Search Without a PhD
Semantic search has transformed how we find information in large document collections, moving beyond simple keyword matching to understanding the actual meaning behind our queries. While implementing semantic search used to require advanced machine learning knowledge, modern tools and libraries have made it accessible to developers of all skill levels.
Understanding Semantic Search
At its core, semantic search works by converting text into vector representations (embeddings) that capture meaning. Similar meanings result in similar vectors, allowing us to find relevant documents by measuring vector similarity.
The key components of a semantic search system include:
- Embedding model: Converts text to vectors
- Vector database: Stores and enables efficient search of these vectors
- Retrieval mechanism: Finds the most similar vectors to a query
- Ranking system: Orders results by relevance
A Simple Implementation in Python
Let's build a practical semantic search system using widely available tools:
import os
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
class SimpleSemanticSearch:
def __init__(self, embedding_model="text-embedding-3-small"):
self.embedding_model = embedding_model
self.documents = []
self.embeddings = []
def add_documents(self, documents):
"""Add documents to the search index"""
self.documents.extend(documents)
# Generate embeddings for new documents
new_embeddings = []
for doc in tqdm(documents, desc="Embedding documents"):
embedding = self._get_embedding(doc)
new_embeddings.append(embedding)
self.embeddings.extend(new_embeddings)
print(f"Added {len(documents)} documents to index")
def search(self, query, top_k=5):
"""Search for documents similar to the query"""
query_embedding = self._get_embedding(query)
# Calculate similarity between query and all documents
similarities = cosine_similarity(
[query_embedding],
self.embeddings
)[0]
# Get indices of top-k most similar documents
top_indices = np.argsort(similarities)[-top_k:][::-1]
# Return top-k documents and their similarity scores
results = []
for idx in top_indices:
results.append({
"document": self.documents[idx],
"similarity": similarities[idx]
})
return results
def _get_embedding(self, text):
"""Generate embedding for a text using OpenAI API"""
response = client.embeddings.create(
model=self.embedding_model,
input=text
)
return response.data[0].embedding
# Example usage
if __name__ == "__main__":
# Sample documents (in a real scenario, these might be loaded from files)
documents = [
"Machine learning is a subfield of artificial intelligence that focuses on data and algorithms.",
"Natural language processing allows computers to understand human language.",
"Deep learning uses neural networks with many layers.",
"Python is a popular programming language for data science and AI.",
"Vector databases store and retrieve high-dimensional vectors efficiently."
]
# Create search engine and add documents
search_engine = SimpleSemanticSearch()
search_engine.add_documents(documents)
# Search example
query = "How do computers understand text?"
results = search_engine.search(query, top_k=2)
print(f"\nSearch results for: '{query}'")
for i, result in enumerate(results, 1):
print(f"{i}. {result['document']} (Similarity: {result['similarity']:.4f})")
This code creates a simple but functional semantic search engine that:
- Takes a collection of documents
- Generates embeddings using OpenAI's model
- Performs similarity search on those embeddings
- Returns the most relevant documents for a query
Scaling Up: Vector Databases
For production systems with larger document collections, you'll want to use a proper vector database instead of the in-memory approach above. Popular options include:
- Pinecone: Fully managed vector database with great performance
- Weaviate: Open-source vector search engine with filtering capabilities
- Qdrant: Self-hostable vector search with extensive filtering options
- Chroma: Lightweight embedding database designed for RAG applications
Here's how you might modify our example to use Chroma:
import chromadb
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
# Initialize Chroma client
chroma_client = chromadb.Client()
# Create or get a collection
collection = chroma_client.create_collection(
name="document_collection",
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
def get_embedding(text, model="text-embedding-3-small"):
"""Generate embeddings using OpenAI API"""
response = client.embeddings.create(
model=model,
input=text
)
return response.data[0].embedding
# Sample documents with IDs
documents = [
"Machine learning is a subfield of artificial intelligence that focuses on data and algorithms.",
"Natural language processing allows computers to understand human language.",
"Deep learning uses neural networks with many layers.",
"Python is a popular programming language for data science and AI.",
"Vector databases store and retrieve high-dimensional vectors efficiently."
]
# Add documents to the collection
document_ids = [f"doc_{i}" for i in range(len(documents))]
embeddings = [get_embedding(doc) for doc in documents]
collection.add(
documents=documents,
embeddings=embeddings,
ids=document_ids
)
# Perform a search
query = "How do computers understand text?"
query_embedding = get_embedding(query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=2
)
print(f"Search results for: '{query}'")
for i, (doc_id, document, distance) in enumerate(zip(
results['ids'][0], results['documents'][0], results['distances'][0]
), 1):
similarity = 1 - distance # Convert distance to similarity
print(f"{i}. {document} (Similarity: {similarity:.4f})")
Handling Large Documents: Chunking
Real-world documents are often long and need to be broken into smaller pieces (chunks) before embedding. A simple approach:
def chunk_text(text, chunk_size=1000, overlap=200):
"""Split text into overlapping chunks of approximately chunk_size characters"""
if len(text) <= chunk_size:
return [text]
chunks = []
start = 0
while start < len(text):
# Find the end of the chunk
end = start + chunk_size
# If we're not at the end of the text, try to find a natural break point
if end < len(text):
# Look for the last period, question mark, or exclamation point
for punct in ['. ', '? ', '! ']:
last_punct = text[start:end].rfind(punct)
if last_punct != -1:
end = start + last_punct + 2 # +2 to include the punctuation and space
break
# Add the chunk to our list
chunks.append(text[start:end])
# Move the start position, accounting for overlap
start = end - overlap
return chunks
Improving Search Quality
To enhance your semantic search, consider these techniques:
-
Hybrid search: Combine semantic search with keyword matching for better results on specific terms
-
Re-ranking: Use a more powerful model to re-rank the top results from your initial search
-
Query expansion: Automatically expand queries to include related terms or concepts
-
Filtering: Allow users to filter results by metadata (date, author, category, etc.)
Conclusion
Building a semantic search system no longer requires advanced degrees or specialized knowledge. With modern tools and APIs, you can implement sophisticated semantic search capabilities in a few hundred lines of code.
The examples provided here give you a starting point that you can build upon, customize, and scale according to your specific needs. As you grow more comfortable with these concepts, you can explore more advanced techniques and optimizations to create an even more powerful search experience.
Remember that the quality of your embeddings plays a crucial role in search performance, so experimenting with different embedding models can yield significant improvements as your system evolves.