跳到主要内容
印格

API documentation · Agent Memory API

Knowledge-base retrieval benchmark

ENGRA-KB-v2 paper-scale suite (550 docs / 400 queries) plus ENGRA-KB-v1 smoke: compare BM25, chunk-RAG, and structured atom indexing on identical corpora.

View published test results →

Why this benchmark

Most KB/RAG products lack reproducible public baselines. ENGRA-KB-v2 ships 550 synthetic enterprise articles and 400 queries (BEIR subtask scale, e.g. SciFact ~300 queries), runnable offline; ENGRA-KB-v1 (12/30) remains for fast smoke tests.

Systems under test

IDSystemFamilyProduct analogue
bm25BM25LexicalConfluence / Notion keyword find
rag-chunk-512RAG fixed-512Traditional RAGLangChain default TextSplitter(512)
rag-chunk-256-overlapRAG 256+128 overlapTraditional RAGCommon PDF / long-doc ingest pipelines
rag-sentenceRAG sentenceTraditional RAGSentence-boundary chunks merged to ~480 chars
atom-structuredStructured atomStructured KBEngra Persona / Scope / Topic / Atom single-vector model

Metrics

Recall@K
Whether any labeled relevant document appears in top K (document-level, BEIR-style)
MRR
Mean reciprocal rank of the first relevant document
nDCG@K
Rank-aware gain; relevant docs ranked higher score better
ms/q
Mean retrieval latency per query (local CPU, no LLM generation)
Task success
Agent proxy: whether top-K context contains extractive gold span (`benchmark:memory:agent-task`)
Token F1
Gemini E2E QA: token F1 vs gold span (`benchmark:memory:gemini-qa`)

Datasets ENGRA-KB-v2 (primary) / v1 (smoke)

  • ENGRA-KB-v2: 550 English enterprise KB articles (11 departments × 55 topics × 10 variants)
  • 400 queries: lexical 120, semantic 126, disambiguation 44, Chinese 80, cross-scope 30
  • ENGRA-KB-v1: 12 docs / 30 queries for sub-minute smoke runs
  • Fixtures: scripts/memory-benchmarks/fixtures/engra-kb-v2/ (primary) and engra-kb-v1/
  • Regenerate v2: npm run benchmark:memory:generate-v2

How to run

  1. npm run benchmark:memory:v2 — paper-scale ENGRA-KB-v2
  2. npm run benchmark:memory:ablation — atom metadata ablation
  3. npm run benchmark:memory:agent-task — context sufficiency proxy (no LLM)
  4. npm run benchmark:memory:gemini-qa -- --subset full — Gemini end-to-end QA (requires GEMINI_API_KEY)
  5. npm run benchmark:memory — ENGRA-KB-v1 smoke (default --suite engra-kb-v1)
  6. npm run benchmark:mteb:install && npm run benchmark:mteb — MTEB retrieval (SciFact / NFCorpus / FiQA2018)
  7. Fast smoke (BM25 only): npm run benchmark:memory:lexical
  8. Default embedder: Xenova/all-MiniLM-L6-v2 (MTEB equivalent: sentence-transformers/all-MiniLM-L6-v2)

Relation to industry benchmarks

  • MTEB: SciFact / NFCorpus / FiQA2018 retrieval subtasks integrated; scores compared to the official Leaderboard above
  • ENGRA-KB-v2: paper / product chunking comparison (BM25 vs chunk-RAG vs Atom)
  • Gemini E2E QA: frozen reader over top-5 context — see published results page
  • ENGRA-KB-v1: fast pipeline smoke during development
  • RAGAS: generation faithfulness; both suites evaluate retrieval only, no LLM answers

Caveats

  • Synthetic corpus — not representative of real customer traffic; cite suite version and embedder when publishing numbers
  • Gemini QA with gemini-embedding-2 is not directly comparable to MiniLM retrieval tables
  • Production stacks add rerankers, hybrid search, ACL filters; this benchmark isolates the retrieval core
  • Few Chinese queries — treat multilingual conclusions carefully

← Agent Memory API · Test results · Product overview