§ 3Retrieval Accuracy Benchmark
Retrieval-Augmented Generation (RAG) promises to extend LLM memory indefinitely, but at what cost to precision? This module exposes the hidden tradeoffs: how chunk size, overlap, top-k depth, and corpus scale interact to determine whether the right memory is retrieved. Every slider movement recomputes precision, recall, F1, and MRR across a matrix of 500+ precomputed benchmark configurations.
Retrieval Parameters
Chunk Size?Number of tokens per text chunk. Smaller chunks are more precise; larger chunks provide more context per result.
256 tokens
04
641282565121024
Chunk Overlap?Percentage of overlap between adjacent chunks. Higher overlap improves recall by ensuring boundary content is not missed.
20%
0%50%
Top-K Results?Number of most-similar chunks returned by the retrieval step. Higher K improves recall but may dilute precision.
5
04
1351020
Embedding Model?Dimensionality of the embedding model. Larger models capture more semantic nuance but cost more to compute.
Corpus Size
?Total number of documents in the search corpus. Larger corpora make retrieval harder.10
100
1K
10K
Benchmark Results
Precision
0.857
Recall
0.699
F1 Score
0.770
MRR
0.773
Latency
13.3 ms
Tokens Used
1,280
Figure 4
Figure 5
Query Complexity
Key Insights
- Strong F1 score (0.77): precision and recall are well balanced at this configuration.
§ 3.5Validate Live: Test Retrieval Relevance
Provide a query and text chunks (separated by ---). The LLM ranks each chunk by relevance, simulating what a RAG system must decide when choosing which chunks to inject into context.
6 chunks