Knowledge-base retrieval benchmark

ENGRA-KB-v2 paper-scale suite (550 docs / 400 queries) plus ENGRA-KB-v1 smoke: compare BM25, chunk-RAG, and structured atom indexing on identical corpora.

View published test results →

Why this benchmark

Most KB/RAG products lack reproducible public baselines. ENGRA-KB-v2 ships 550 synthetic enterprise articles and 400 queries (BEIR subtask scale, e.g. SciFact ~300 queries), runnable offline; ENGRA-KB-v1 (12/30) remains for fast smoke tests.

Systems under test

ID	System	Family	Product analogue
bm25	BM25	Lexical	Confluence / Notion keyword find
rag-chunk-512	RAG fixed-512	Traditional RAG	LangChain default TextSplitter(512)
rag-chunk-256-overlap	RAG 256+128 overlap	Traditional RAG	Common PDF / long-doc ingest pipelines
rag-sentence	RAG sentence	Traditional RAG	Sentence-boundary chunks merged to ~480 chars
atom-structured	Structured atom	Structured KB	Engra Library / Scope / Topic / Atom single-vector model

Metrics

Recall@K: Whether any labeled relevant document appears in top K (document-level, BEIR-style)
MRR: Mean reciprocal rank of the first relevant document
nDCG@K: Rank-aware gain; relevant docs ranked higher score better
ms/q: Mean retrieval latency per query (local CPU, no LLM generation)
Task success: Agent proxy: whether top-K context contains extractive gold span (`benchmark:memory:agent-task`)
Token F1: Gemini E2E QA: token F1 vs gold span (`benchmark:memory:gemini-qa`)

Datasets ENGRA-KB-v2 (primary) / v1 (smoke)

ENGRA-KB-v2: 550 English enterprise KB articles (11 departments × 55 topics × 10 variants)
400 queries: lexical 120, semantic 126, disambiguation 44, Chinese 80, cross-scope 30
ENGRA-KB-v1: 12 docs / 30 queries for sub-minute smoke runs
Fixtures: scripts/memory-benchmarks/fixtures/engra-kb-v2/ (primary) and engra-kb-v1/
Regenerate v2: npm run benchmark:memory:generate-v2

How to run

npm run benchmark:memory:v2 — paper-scale ENGRA-KB-v2
npm run benchmark:memory:ablation — atom metadata ablation
npm run benchmark:memory:agent-task — context sufficiency proxy (no LLM)
npm run benchmark:memory:gemini-qa -- --subset full — Gemini end-to-end QA (requires GEMINI_API_KEY)
npm run benchmark:memory — ENGRA-KB-v1 smoke (default --suite engra-kb-v1)
npm run benchmark:mteb:install && npm run benchmark:mteb — MTEB retrieval (SciFact / NFCorpus / FiQA2018)
Fast smoke (BM25 only): npm run benchmark:memory:lexical
Default embedder: Xenova/all-MiniLM-L6-v2 (MTEB equivalent: sentence-transformers/all-MiniLM-L6-v2)

Relation to industry benchmarks

MTEB: SciFact / NFCorpus / FiQA2018 retrieval subtasks integrated; scores compared to the official Leaderboard above
ENGRA-KB-v2: paper / product chunking comparison (BM25 vs chunk-RAG vs Atom)
Gemini E2E QA: frozen reader over top-5 context — see published results page
ENGRA-KB-v1: fast pipeline smoke during development
RAGAS: generation faithfulness; both suites evaluate retrieval only, no LLM answers

Caveats

Synthetic corpus — not representative of real customer traffic; cite suite version and embedder when publishing numbers
Gemini QA with gemini-embedding-2 is not directly comparable to MiniLM retrieval tables
Production stacks add rerankers, hybrid search, ACL filters; this benchmark isolates the retrieval core
Few Chinese queries — treat multilingual conclusions carefully

← Agent Memory API · Test results · Product overview