API documentation · Agent Memory API
Knowledge-base retrieval benchmark
ENGRA-KB-v2 paper-scale suite (550 docs / 400 queries) plus ENGRA-KB-v1 smoke: compare BM25, chunk-RAG, and structured atom indexing on identical corpora.
Why this benchmark
Most KB/RAG products lack reproducible public baselines. ENGRA-KB-v2 ships 550 synthetic enterprise articles and 400 queries (BEIR subtask scale, e.g. SciFact ~300 queries), runnable offline; ENGRA-KB-v1 (12/30) remains for fast smoke tests.
Systems under test
| ID | System | Family | Product analogue |
|---|---|---|---|
| bm25 | BM25 | Lexical | Confluence / Notion keyword find |
| rag-chunk-512 | RAG fixed-512 | Traditional RAG | LangChain default TextSplitter(512) |
| rag-chunk-256-overlap | RAG 256+128 overlap | Traditional RAG | Common PDF / long-doc ingest pipelines |
| rag-sentence | RAG sentence | Traditional RAG | Sentence-boundary chunks merged to ~480 chars |
| atom-structured | Structured atom | Structured KB | Engra Persona / Scope / Topic / Atom single-vector model |
Metrics
- Recall@K
- Whether any labeled relevant document appears in top K (document-level, BEIR-style)
- MRR
- Mean reciprocal rank of the first relevant document
- nDCG@K
- Rank-aware gain; relevant docs ranked higher score better
- ms/q
- Mean retrieval latency per query (local CPU, no LLM generation)
- Task success
- Agent proxy: whether top-K context contains extractive gold span (`benchmark:memory:agent-task`)
- Token F1
- Gemini E2E QA: token F1 vs gold span (`benchmark:memory:gemini-qa`)
Datasets ENGRA-KB-v2 (primary) / v1 (smoke)
- ENGRA-KB-v2: 550 English enterprise KB articles (11 departments × 55 topics × 10 variants)
- 400 queries: lexical 120, semantic 126, disambiguation 44, Chinese 80, cross-scope 30
- ENGRA-KB-v1: 12 docs / 30 queries for sub-minute smoke runs
- Fixtures: scripts/memory-benchmarks/fixtures/engra-kb-v2/ (primary) and engra-kb-v1/
- Regenerate v2: npm run benchmark:memory:generate-v2
How to run
npm run benchmark:memory:v2 — paper-scale ENGRA-KB-v2npm run benchmark:memory:ablation — atom metadata ablationnpm run benchmark:memory:agent-task — context sufficiency proxy (no LLM)npm run benchmark:memory:gemini-qa -- --subset full — Gemini end-to-end QA (requires GEMINI_API_KEY)npm run benchmark:memory — ENGRA-KB-v1 smoke (default --suite engra-kb-v1)npm run benchmark:mteb:install && npm run benchmark:mteb — MTEB retrieval (SciFact / NFCorpus / FiQA2018)- Fast smoke (BM25 only): npm run benchmark:memory:lexical
- Default embedder: Xenova/all-MiniLM-L6-v2 (MTEB equivalent: sentence-transformers/all-MiniLM-L6-v2)
Relation to industry benchmarks
- MTEB: SciFact / NFCorpus / FiQA2018 retrieval subtasks integrated; scores compared to the official Leaderboard above
- ENGRA-KB-v2: paper / product chunking comparison (BM25 vs chunk-RAG vs Atom)
- Gemini E2E QA: frozen reader over top-5 context — see published results page
- ENGRA-KB-v1: fast pipeline smoke during development
- RAGAS: generation faithfulness; both suites evaluate retrieval only, no LLM answers
Caveats
- Synthetic corpus — not representative of real customer traffic; cite suite version and embedder when publishing numbers
- Gemini QA with gemini-embedding-2 is not directly comparable to MiniLM retrieval tables
- Production stacks add rerankers, hybrid search, ACL filters; this benchmark isolates the retrieval core
- Few Chinese queries — treat multilingual conclusions carefully