Overview
Hybrid Search combines multiple retrieval approaches:
- Keyword-based search (sparse retrieval)
- Semantic search (dense retrieval)
- Fusion of results using weighted scoring
- Dynamic adjustment of retrieval strategies
Implementation Example
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
import numpy as np
class HybridRetriever:
def __init__(self, documents):
self.documents = documents
self.tfidf = TfidfVectorizer()
self.tfidf_matrix = self.tfidf.fit_transform(documents)
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.embeddings = self.embedder.encode(documents)
def search(self, query, alpha=0.7):
# Sparse retrieval (TF-IDF)
query_vec = self.tfidf.transform([query])
sparse_scores = (query_vec * self.tfidf_matrix.T).toarray()[0]
# Dense retrieval (embeddings)
query_embedding = self.embedder.encode(query)
dense_scores = cosine_similarity(
[query_embedding],
self.embeddings
)[0]
# Combine scores
combined_scores = alpha * dense_scores + (1 - alpha) * sparse_scores
top_indices = np.argsort(combined_scores)[::-1]
return [(self.documents[i], combined_scores[i])
for i in top_indices]
# Usage
documents = [...] # Your document collection
retriever = HybridRetriever(documents)
results = retriever.search("RAG implementation")
When to Use
- When both precision and recall are important
- For queries that benefit from both keyword and semantic matching
- When dealing with diverse document types and query patterns
- When you need to balance between exact matches and conceptual understanding