§ 3Retrieval Accuracy Benchmark

Retrieval-Augmented Generation (RAG) promises to extend LLM memory indefinitely, but at what cost to precision? This module exposes the hidden tradeoffs: how chunk size, overlap, top-k depth, and corpus scale interact to determine whether the right memory is retrieved. Every slider movement recomputes precision, recall, F1, and MRR across a matrix of 500+ precomputed benchmark configurations.

Retrieval Parameters

Chunk Size?Number of tokens per text chunk. Smaller chunks are more precise; larger chunks provide more context per result.
256 tokens
04
641282565121024
Chunk Overlap?Percentage of overlap between adjacent chunks. Higher overlap improves recall by ensuring boundary content is not missed.
20%
0%50%
Top-K Results?Number of most-similar chunks returned by the retrieval step. Higher K improves recall but may dilute precision.
5
04
1351020
Embedding Model?Dimensionality of the embedding model. Larger models capture more semantic nuance but cost more to compute.

Corpus Size

?Total number of documents in the search corpus. Larger corpora make retrieval harder.
500 docs
10
100
1K
10K

Benchmark Results

Precision

0.857

Recall

0.699

F1 Score

0.770

MRR

0.773

Latency

13.3 ms

Tokens Used

1,280

Figure 4

Precision, Recall, and F1 as a function of top-k retrieval depth. Increasing k trades precision for recall; the F1 curve reveals the optimal balance point.

Figure 5

F1 score as a function of chunk size, stratified by corpus size. The peak shifts rightward for larger corpora, revealing that optimal chunk size depends on collection scale.

Query Complexity

Key Insights

  • Strong F1 score (0.77): precision and recall are well balanced at this configuration.

§ 3.5Validate Live: Test Retrieval Relevance

Provide a query and text chunks (separated by ---). The LLM ranks each chunk by relevance, simulating what a RAG system must decide when choosing which chunks to inject into context.

6 chunks