OpenViking Benchmark Update: User Memory, Agent Memory, and Knowledge Base QA

The latest OpenViking benchmark update answers a narrow product question: when OpenViking is added as the context layer for agents and RAG workflows, do accuracy, latency, and token cost move in the right direction at the same time?

User Memory

80%+

LoCoMo accuracy across OpenClaw, Hermes, and Claude Code integrations.

Agent Memory

+11.87pp

Airline task accuracy lift on tau2-bench after adding experience memory.

Knowledge Base QA

91.00%

HotpotQA accuracy with OpenViking top-20 retrieval at 0.23s retrieval latency.

I: User Memory

LoCoMo tests whether an agent can answer questions that depend on long-range conversation history. The important part is that OpenViking was not measured on one bespoke agent. It was attached to OpenClaw, Hermes, and Claude Code, and all three crossed 80% accuracy.

Integration	Accuracy	Avg. query time	Input tokens
OpenClaw native memory	24.20%	95.14s	392,559,404
OpenClaw + OpenViking	82.08%	38.8s	37,423,456
Hermes native memory	33.38%	82.4s	79,228,398
Hermes + OpenViking	82.86%	27.9s	52,026,755
Claude Code auto-memory	57.21%	49.1s	353,306,422
Claude Code + OpenViking	80.32%	20.4s	129,968,899

The efficiency movement is just as important as the accuracy movement. Compared with each native baseline, latency dropped by roughly 58% to 66%, while token use dropped by 34% to 91%.

Agent	Accuracy lift	Latency reduction	Token reduction
OpenClaw	24.20% → 82.08% (+3.39×)	-59.22%	-91.0%
Hermes	33.38% → 82.86% (+2.48×)	-66.10%	-34.3%
Claude Code	57.21% → 80.32% (+1.40×)	-58.45%	-63.2%

II: Agent Memory

User memory answers “what does this user care about?” Agent memory answers a different question: “what has this agent learned from previous work that should change the next attempt?” The new results look at both economic simulation and task success.

ClawWork

+69.34%

Net income after 50 tasks increased from $2,269.77 to $3,843.74.

ClawWork

-22.8%

Average hourly token use dropped from 1,030.3K/h to 872.4K/h.

tau2-bench

+6.87pp / +11.87pp

Retail and Airline accuracy improved after adding OpenViking experience memory.

Setting	Retail accuracy	Airline accuracy
LLM without memory	70.94%	54.38%
LLM + OpenViking experience memory	77.81% (+6.87pp)	66.25% (+11.87pp)

III: Knowledge Base QA

Knowledge-base QA is where the trade-off becomes visible. Some systems push accuracy by spending many tokens or accepting high retrieval latency. OpenViking aims for a practical point: strong accuracy with low retrieval latency and controlled indexing cost.

HotpotQA

Method	Pattern	Accuracy	Tokens / QA	Latency / QA
Naive RAG	Vector	62.50%	1,290	0.11s
HippoRAG 2	Vector + KG	61.00%	726	20s
LightRAG	Vector + KG	89.00%	28,443	75s
LangChain SQL	SQL agent	78.00%	4,776	132s
OpenViking top-5	Vector	72.75%	3,154	0.22s
OpenViking top-20	Vector	91.00%	12,533	0.23s
Nanobot + OpenViking	Vector + agent	87.00%	71,300	61.6s

Single-turn RAG Average

Across FinanceBench, NaturalQuestions, ClapNQ, Qasper, and SyllabusQA, OpenViking reaches 66.87% average accuracy with 0.19s retrieval latency. Its indexing token cost is 8.67M, about 13.8% of LightRAG in this comparison.

Method	Average accuracy	Indexing tokens	Tokens / QA	Retrieval latency
Naive RAG	53.93%	2,755,356	1,435	0.13s
PageIndex	36.75%	5,609,206	710,480	84.60s
HippoRAG 2	44.50%	124,963,618	637	18.83s
LightRAG	76.00%	62,705,469	27,035	9.19s
OpenViking	66.87%	8,671,538	3,060	0.19s

What These Numbers Say

The pattern is consistent: OpenViking is strongest when the product needs a context-management loop. Long user history, previous agent attempts, and large knowledge bases all need a layer that can store, narrow, retrieve, and reuse context without forcing every token into the prompt.

The research track is moving in the same direction. The VikingMem paper has been accepted by VLDB 2026, and OpenViking exposes part of that context-database direction as an open-source system developers can try today.