Knowledge-base retrieval benchmark
ENGRA-KB-v2 paper-scale suite (550 docs / 400 queries) plus ENGRA-KB-v1 smoke: compare BM25, chunk-RAG, and structured atom indexing on identical corpora.
Published test results
ENGRA-KB-v2 (paper scale) answers how we beat BM25 / chunk-RAG; the embedder sweep shows MiniLM vs bge vs E5 on 400 queries; ENGRA-KB-v1 is smoke; MTEB validates on public corpora.
Published benchmark scores below are loaded from repository fixtures and update when JSON is refreshed.
ENGRA-KB-v2
ENGRA-KB-v2 reference scores
Compares BM25, traditional chunk-RAG, and Engra Structured atom on 550 KB articles and 400 queries — paper-scale suite (use v1 for smoke tests).
- Suite
- ENGRA-KB-v2
- Queries
- 400
- Embedder
- Xenova/all-MiniLM-L6-v2
- Run date
- Jun 1, 2026, 7:23 PM
Advantage vs BM25 (Recall@5)
- Structured atom Recall@5 0.502 vs BM25 0.412 (+0.090))
- MRR 0.307 vs BM25 0.258 — dense RAG / Atom semantic Recall@5 at 0.360
- Atom latency 2.0 ms/q, below fixed-512 RAG (2.5 ms/q), with Scope/Topic metadata retained.
By Query Category (Task success @ contextK=5)
From agent-task-results.json: whether top-K retrieval contains extractive gold spans. This is closer to downstream agent answerability.
| System | R@1 | R@3 | R@5 | R@10 | MRR | nDCG@1 | nDCG@3 | nDCG@5 | nDCG@10 | ms/q | Chunks |
|---|---|---|---|---|---|---|---|---|---|---|---|
Structured atom Engra atom model — one vector per doc with title / scope / topic in the embed source. Top Recall@5 | 0.145 | 0.378 | 0.502 | 0.752 | 0.307 | 0.145 | 0.261 | 0.309 | 0.392 | 2.0 | 550 |
BM25 Lexical baseline: fast keyword hits, weak on paraphrases and Chinese queries. | 0.128 | 0.345 | 0.412 | 0.520 | 0.258 | 0.128 | 0.245 | 0.273 | 0.307 | 2.1 | 555 |
RAG sentence merge Sentence-boundary chunks merged to ~480 chars; same scores as fixed-512 here. | 0.092 | 0.235 | 0.362 | 0.545 | 0.204 | 0.092 | 0.161 | 0.210 | 0.268 | 2.2 | 557 |
RAG fixed-512 Most common RAG default — 512-char splits, typical LangChain / LlamaIndex setup. | 0.090 | 0.237 | 0.360 | 0.547 | 0.204 | 0.090 | 0.161 | 0.209 | 0.269 | 2.5 | 555 |
RAG 256+128 overlap PDF-style smaller chunks with overlap; slightly lower Recall@5 on short docs. | 0.077 | 0.195 | 0.295 | 0.542 | 0.183 | 0.077 | 0.137 | 0.176 | 0.254 | 2.8 | 1324 |
How to read these numbers
- BM25 reaches ~86% Recall@5 on lexical queries but lags on semantic and multilingual categories.
- Traditional fixed-512 / sentence chunk-RAG stays near ~22–39% Recall@5 on semantic queries on this corpus.
- Structured atom (title + scope + topic, one vector per doc) leads on semantic (~68%) and disambiguation (~89%) — hierarchical metadata helps ranking.
- Smaller overlapping chunks (256+128) score lowest overall — chunk size must match document length.
Run npm run benchmark:memory:v2 locally; smoke suite: npm run benchmark:memory (v1). JSON: scripts/memory-benchmarks/fixtures/engra-kb-v2/reference-results.json.
Download full JSON ↓Stronger embedders
Embedder comparison (ENGRA-KB-v2)
Same corpus and pipelines (BM25 / RAG fixed-512 / Structured atom) under MiniLM, bge-small-en-v1.5, and multilingual-e5-small. MiniLM is the product default; stronger embedders mainly lift Atom and semantic queries.
- Query subset
- full (400)
- Queries
- 400
- Documents
- 550
- Updated
- Jun 2, 2026, 5:06 AM
| Embedder | BM25 R@5 | RAG R@5 | Atom R@5 | Atom−RAG |
|---|---|---|---|---|
MiniLM-L6-v2 Xenova/all-MiniLM-L6-v2 | 0.412 | 0.355 | 0.505 | +15.0 pp |
bge-small-en-v1.5 Xenova/bge-small-en-v1.5 | 0.412 | 0.325 | 0.515 | +19.0 pp |
multilingual-e5-small Xenova/multilingual-e5-small Best Atom | 0.412 | 0.367 | 0.552 | +18.5 pp |
Macro Recall@5 over all query categories. See ENGRA-KB-v2 above for per-system detail (MiniLM default).
How to read
- BM25 is embedder-agnostic — identical R@5 across columns (~41.3%).
- bge / E5 raise Atom macro Recall@5 from 50.5% (MiniLM) to 55.3% (E5); chunk-RAG gains are modest, showing structured atoms matter beyond swapping models.
- E5 + Atom reaches ~77% Recall@5 on semantic queries (full-run logs) vs ~14% BM25 and ~19% chunk-RAG.
- Reproduce: npm run benchmark:memory:embedder-sweep -- --subset full
npm run benchmark:memory:embedder-sweep
Download full JSON ↓Checks whether Agent Memory's default embedder (all-MiniLM-L6-v2) reproduces official MTEB leaderboard scores. Delta ≈ 0 means alignment succeeded.
MTEB embed baseline
Industry alignment · not product advantageMTEB embedding baseline validation
The embed baseline table compares local runs vs the official leaderboard. Pipeline comparison runs BM25, chunk-RAG, and Atom on the same MTEB corpora below.
MTEB corpora · pipeline comparison (ndcg@10)
BM25, RAG fixed-512, and Structured atom on each MTEB test split (same strategies as ENGRA-KB-v1, default embedder all-MiniLM-L6-v2).
| Task | BM25 | RAG 512 | Atom |
|---|---|---|---|
| Financial QA retrieval | 0.200 | 0.357 | 0.350 |
| Nutrition / biomedical | 0.268 | 0.278 | 0.307 |
| Scientific fact retrieval | 0.594 | 0.642 | 0.647 |
| Mean ndcg@10 (3 tasks) | 0.354 | 0.426 | 0.435 |
Embed baseline (vs leaderboard)
- MTEB model
- sentence-transformers/all-MiniLM-L6-v2
- Engra local equivalent
- Xenova/all-MiniLM-L6-v2
- Tasks
- 3
- Updated
- May 31, 2026, 9:13 PM
- mean ndcg@10
- 0.443
| Task | Local ndcg@10 | Leaderboard ndcg@10 | Delta | map@10 | mrr@10 | recall@10 |
|---|---|---|---|---|---|---|
SciFact Scientific fact retrieval | 0.645 | 0.645 | 0.0000 | 0.596 | 0.605 | 0.783 |
NFCorpus Nutrition / biomedical | 0.316 | 0.316 | -0.0001 | 0.110 | 0.505 | 0.155 |
FiQA2018 Financial QA retrieval | 0.369 | 0.369 | 0.0000 | 0.291 | 0.445 | 0.441 |
How to read
- Embed baseline: delta ≈ 0 means the default embedder matches official MTEB scores.
- Pipeline comparison: chunking/index strategy differences; document-level binary ndcg@10 — compare relative ranking, not absolute vs leaderboard.
- Dense methods (RAG / Atom) usually beat BM25 on semantic tasks; Atom is often close to or above fixed-size RAG on passage corpora.
How this relates to ENGRA-KB-v1
ENGRA-KB-v1 is a controlled enterprise KB; MTEB pipeline comparison validates the same chunking/index strategies on public retrieval corpora.
Embed baseline: npm run benchmark:mteb · Pipeline: npm run benchmark:mteb:pipeline