Hybrid Search and RAG on Amazon Reviews
Summary
This project builds an information retrieval and question-answering system over the Musical Instruments category of the Amazon Reviews 2023 dataset. It implements BM25 (keyword-based) and semantic search (embedding-based) retrievers, combines them into a hybrid retriever, and then layers a Retrieval-Augmented Generation (RAG) pipeline on top to answer shopping-style queries using real reviews.
An interactive Shiny app lets users choose BM25, semantic, or hybrid retrieval, submit natural-language queries, inspect retrieved reviews, and view LLM-generated answers grounded in that context. The work also explores scaling the corpus from a 500-document sample up to 135,701 documents and compares different Llama 3 models for answer quality.
Data
The system operates on the Musical Instruments subset of the Amazon Reviews 2023 dataset, working with 135,701 review–product pairs after joining review text with product metadata.
Each document passed to the retrievers is formed by concatenating three fields into a single text string:
- product_title — product name
- review_title — user-written review headline
- review_text — full review body
Additional metadata such as ratings, average_rating, and price are retained for display in the app but not used directly for retrieval scoring.
Retrieval Design
Preprocessing
Text preprocessing is applied consistently to both the corpus and incoming queries: lowercasing, punctuation removal, whitespace normalization, and a minimal English stopword list (e.g. “the”, “and”, “is”) to preserve recall on specific product queries. The processed corpus is saved to a Parquet file for reuse.
BM25 (keyword-based)
BM25 retrieval is implemented with rank_bm25 (BM25Okapi). Documents are pre-tokenized and stored alongside the processed corpus so that index builds do not need to re-tokenize all 135k documents. The BM25 index is serialized and loaded at app startup.
Semantic Search (embeddings)
Semantic search uses the sentence-transformers model all-MiniLM-L6-v2 to embed documents and queries, with FAISS (IndexFlatIP) as the vector index. Embeddings are L2-normalized so cosine similarity can be computed via inner product. The embedding pipeline is optimized for scale by avoiding unnecessary array copies and increasing batch size.
Hybrid Retrieval
A HybridRetriever combines BM25 and semantic search using Reciprocal Rank Fusion (RRF). Rather than ranking the full 135k-document corpus on every query, each retriever returns a capped candidate set (e.g. max(top_k × 10, 200)) and RRF fuses them, which preserves quality while keeping latency manageable.
RAG Pipeline & LLM Choice
On top of retrieval, the project implements two RAG pipelines:
- SemanticRAGPipeline — uses the FAISS semantic retriever
- HybridRAGPipeline — uses the hybrid BM25 + semantic retriever
Each RAG pipeline follows a three-stage pattern: retrieve top-k documents, build a structured context from reviews and product titles, and send the query plus context to an LLM for answer generation.
Different Llama 3 models from Groq were evaluated, including llama-3.1-8b-instant and llama-3.3-70b-versatile, on realistic shopping queries (e.g. “best acoustic guitar for beginners under $100”, “durable microphone stand for live performances”). The 70B model consistently produced more precise, well-cited answers and was chosen as the default in the final pipeline.
Evaluation & Limitations
The hybrid RAG pipeline was evaluated qualitatively on five query types (keyword, semantic, and complex multi-constraint queries), scoring answers on accuracy, completeness, and fluency. With the larger corpus, the pipeline produced substantive, grounded answers for all five queries, whereas earlier experiments on a smaller index often failed due to missing relevant documents.
Two key limitations emerged: the retriever cannot enforce all query constraints simultaneously (e.g. “for rock” and “under $300” for a beginner electric guitar), and the LLM tends to generate from weak or loosely related context rather than explicitly stating that constraints cannot be verified. Suggested improvements include adding a cross-encoder re-ranker, tightening prompts to enforce constraint checking, and using metadata filters (e.g. restricting to electric guitars) before retrieval.
Scaling & Deployment Considerations
To scale beyond 100k documents, the workflow optimizes corpus storage, BM25 index construction, and embedding generation, reducing redundant work and memory usage. Indexes are saved as artifacts that can be loaded into memory on app startup.
The final discussion also outlines a cloud deployment plan: storing raw and processed data plus indexes in S3, containerizing the Shiny app on EC2 behind a load balancer, and periodically rebuilding indexes via scheduled jobs (e.g. AWS Lambda) when new products are added. LLM inference is assumed to use hosted APIs rather than self-hosted models to avoid GPU infrastructure overhead.