跳到主要内容
印格

Agent Memory

Knowledge-base retrieval benchmark

ENGRA-KB-v2 paper-scale suite (550 docs / 400 queries) plus ENGRA-KB-v1 smoke: compare BM25, chunk-RAG, and structured atom indexing on identical corpora.

Published test results

ENGRA-KB-v2 (paper scale) answers how we beat BM25 / chunk-RAG; the embedder sweep shows MiniLM vs bge vs E5 on 400 queries; ENGRA-KB-v1 is smoke; MTEB validates on public corpora.

Published benchmark scores below are loaded from repository fixtures and update when JSON is refreshed.

ENGRA-KB-v2Paper / product advantage

550 docs · 400 queries · 5 systems

Download JSON
Embedder sweepStronger embedders

MiniLM · bge-small · E5 · full 400 queries

Download JSON
ENGRA-KB-v1Fast local check

12 docs · 30 queries · smoke suite

Download JSON
MTEB retrieval-v1Embed check + pipeline

SciFact · NFCorpus · FiQA2018

Download JSON

ENGRA-KB-v2

ENGRA-KB-v2 reference scores

Compares BM25, traditional chunk-RAG, and Engra Structured atom on 550 KB articles and 400 queries — paper-scale suite (use v1 for smoke tests).

Suite
ENGRA-KB-v2
Queries
400
Embedder
Xenova/all-MiniLM-L6-v2
Run date
Jun 1, 2026, 7:23 PM

Advantage vs BM25 (Recall@5)

  • Structured atom Recall@5 0.502 vs BM25 0.412 (+0.090))
  • MRR 0.307 vs BM25 0.258 — dense RAG / Atom semantic Recall@5 at 0.360
  • Atom latency 2.0 ms/q, below fixed-512 RAG (2.5 ms/q), with Scope/Topic metadata retained.

By Query Category (Task success @ contextK=5)

From agent-task-results.json: whether top-K retrieval contains extractive gold spans. This is closer to downstream agent answerability.

A = AtomR = RAG fixed-512B = BM25
SemanticA 0.698 · R 0.246 · B 0.143
LexicalA 0.367 · R 0.517 · B 0.892
MultilingualA 0.188 · R 0.188 · B 0.113
Cross-scopeA 0.500 · R 0.400 · B 0.167
DisambiguationA 0.068 · R 0.000 · B 0.023
SystemR@1R@3R@5R@10MRRnDCG@1nDCG@3nDCG@5nDCG@10ms/qChunks
Structured atom

Engra atom model — one vector per doc with title / scope / topic in the embed source.

Top Recall@5
0.1450.3780.5020.7520.3070.1450.2610.3090.3922.0550
BM25

Lexical baseline: fast keyword hits, weak on paraphrases and Chinese queries.

0.1280.3450.4120.5200.2580.1280.2450.2730.3072.1555
RAG sentence merge

Sentence-boundary chunks merged to ~480 chars; same scores as fixed-512 here.

0.0920.2350.3620.5450.2040.0920.1610.2100.2682.2557
RAG fixed-512

Most common RAG default — 512-char splits, typical LangChain / LlamaIndex setup.

0.0900.2370.3600.5470.2040.0900.1610.2090.2692.5555
RAG 256+128 overlap

PDF-style smaller chunks with overlap; slightly lower Recall@5 on short docs.

0.0770.1950.2950.5420.1830.0770.1370.1760.2542.81324

How to read these numbers

  • BM25 reaches ~86% Recall@5 on lexical queries but lags on semantic and multilingual categories.
  • Traditional fixed-512 / sentence chunk-RAG stays near ~22–39% Recall@5 on semantic queries on this corpus.
  • Structured atom (title + scope + topic, one vector per doc) leads on semantic (~68%) and disambiguation (~89%) — hierarchical metadata helps ranking.
  • Smaller overlapping chunks (256+128) score lowest overall — chunk size must match document length.

Run npm run benchmark:memory:v2 locally; smoke suite: npm run benchmark:memory (v1). JSON: scripts/memory-benchmarks/fixtures/engra-kb-v2/reference-results.json.

Download full JSON

Stronger embedders

Embedder comparison (ENGRA-KB-v2)

Same corpus and pipelines (BM25 / RAG fixed-512 / Structured atom) under MiniLM, bge-small-en-v1.5, and multilingual-e5-small. MiniLM is the product default; stronger embedders mainly lift Atom and semantic queries.

Query subset
full (400)
Queries
400
Documents
550
Updated
Jun 2, 2026, 5:06 AM
Full 400 queries · best combo: multilingual-e5-small + Atom Recall@5 0.552 (nDCG@10 0.438, vs RAG + 18.5 pp)
EmbedderBM25 R@5RAG R@5Atom R@5Atom−RAG
MiniLM-L6-v2

Xenova/all-MiniLM-L6-v2

0.4120.3550.505+15.0 pp
bge-small-en-v1.5

Xenova/bge-small-en-v1.5

0.4120.3250.515+19.0 pp
multilingual-e5-small

Xenova/multilingual-e5-small

Best Atom
0.4120.3670.552+18.5 pp

Macro Recall@5 over all query categories. See ENGRA-KB-v2 above for per-system detail (MiniLM default).

How to read

  • BM25 is embedder-agnostic — identical R@5 across columns (~41.3%).
  • bge / E5 raise Atom macro Recall@5 from 50.5% (MiniLM) to 55.3% (E5); chunk-RAG gains are modest, showing structured atoms matter beyond swapping models.
  • E5 + Atom reaches ~77% Recall@5 on semantic queries (full-run logs) vs ~14% BM25 and ~19% chunk-RAG.
  • Reproduce: npm run benchmark:memory:embedder-sweep -- --subset full

npm run benchmark:memory:embedder-sweep

Download full JSON

Checks whether Agent Memory's default embedder (all-MiniLM-L6-v2) reproduces official MTEB leaderboard scores. Delta ≈ 0 means alignment succeeded.

MTEB embed baseline

Industry alignment · not product advantage

MTEB embedding baseline validation

The embed baseline table compares local runs vs the official leaderboard. Pipeline comparison runs BM25, chunk-RAG, and Atom on the same MTEB corpora below.

MTEB corpora · pipeline comparison (ndcg@10)

BM25, RAG fixed-512, and Structured atom on each MTEB test split (same strategies as ENGRA-KB-v1, default embedder all-MiniLM-L6-v2).

Mean ndcg@10 (3 tasks) BM25 0.354 · RAG 0.426 · Atom 0.435
TaskBM25RAG 512Atom
Financial QA retrieval0.2000.3570.350
Nutrition / biomedical0.2680.2780.307
Scientific fact retrieval0.5940.6420.647
Mean ndcg@10 (3 tasks)0.3540.4260.435

Embed baseline (vs leaderboard)

MTEB model
sentence-transformers/all-MiniLM-L6-v2
Engra local equivalent
Xenova/all-MiniLM-L6-v2
Tasks
3
Updated
May 31, 2026, 9:13 PM
mean ndcg@10
0.443
TaskLocal ndcg@10Leaderboard ndcg@10Deltamap@10mrr@10recall@10
SciFact

Scientific fact retrieval

0.6450.6450.00000.5960.6050.783
NFCorpus

Nutrition / biomedical

0.3160.316-0.00010.1100.5050.155
FiQA2018

Financial QA retrieval

0.3690.3690.00000.2910.4450.441

How to read

  • Embed baseline: delta ≈ 0 means the default embedder matches official MTEB scores.
  • Pipeline comparison: chunking/index strategy differences; document-level binary ndcg@10 — compare relative ranking, not absolute vs leaderboard.
  • Dense methods (RAG / Atom) usually beat BM25 on semantic tasks; Atom is often close to or above fixed-size RAG on passage corpora.

How this relates to ENGRA-KB-v1

ENGRA-KB-v1 is a controlled enterprise KB; MTEB pipeline comparison validates the same chunking/index strategies on public retrieval corpora.

View ENGRA-KB-v2 results →

Embed baseline: npm run benchmark:mteb · Pipeline: npm run benchmark:mteb:pipeline

← Back to product overview · Methodology & reproduction