← All essays· Benchmarks

OpenViking Benchmark Update: User Memory, Agent Memory, and Knowledge Base QA

May 2026 OpenViking benchmark update across LoCoMo user memory, ClawWork and tau2-bench agent memory, and HotpotQA / single-turn RAG knowledge-base QA.

The latest OpenViking benchmark update answers a narrow product question: when OpenViking is added as the context layer for agents and RAG workflows, do accuracy, latency, and token cost move in the right direction at the same time?

User Memory
80%+
LoCoMo accuracy across OpenClaw, Hermes, and Claude Code integrations.
Agent Memory
+11.87pp
Airline task accuracy lift on tau2-bench after adding experience memory.
Knowledge Base QA
91.00%
HotpotQA accuracy with OpenViking top-20 retrieval at 0.23s retrieval latency.

I: User Memory

LoCoMo tests whether an agent can answer questions that depend on long-range conversation history. The important part is that OpenViking was not measured on one bespoke agent. It was attached to OpenClaw, Hermes, and Claude Code, and all three crossed 80% accuracy.

IntegrationAccuracyAvg. query timeInput tokens
OpenClaw native memory24.20%95.14s392,559,404
OpenClaw + OpenViking82.08%38.8s37,423,456
Hermes native memory33.38%82.4s79,228,398
Hermes + OpenViking82.86%27.9s52,026,755
Claude Code auto-memory57.21%49.1s353,306,422
Claude Code + OpenViking80.32%20.4s129,968,899

The efficiency movement is just as important as the accuracy movement. Compared with each native baseline, latency dropped by roughly 58% to 66%, while token use dropped by 34% to 91%.

AgentAccuracy liftLatency reductionToken reduction
OpenClaw24.20% → 82.08% (+3.39×)-59.22%-91.0%
Hermes33.38% → 82.86% (+2.48×)-66.10%-34.3%
Claude Code57.21% → 80.32% (+1.40×)-58.45%-63.2%

II: Agent Memory

User memory answers “what does this user care about?” Agent memory answers a different question: “what has this agent learned from previous work that should change the next attempt?” The new results look at both economic simulation and task success.

ClawWork
+69.34%
Net income after 50 tasks increased from $2,269.77 to $3,843.74.
ClawWork
-22.8%
Average hourly token use dropped from 1,030.3K/h to 872.4K/h.
tau2-bench
+6.87pp / +11.87pp
Retail and Airline accuracy improved after adding OpenViking experience memory.
SettingRetail accuracyAirline accuracy
LLM without memory70.94%54.38%
LLM + OpenViking experience memory77.81% (+6.87pp)66.25% (+11.87pp)

III: Knowledge Base QA

Knowledge-base QA is where the trade-off becomes visible. Some systems push accuracy by spending many tokens or accepting high retrieval latency. OpenViking aims for a practical point: strong accuracy with low retrieval latency and controlled indexing cost.

HotpotQA

MethodPatternAccuracyTokens / QALatency / QA
Naive RAGVector62.50%1,2900.11s
HippoRAG 2Vector + KG61.00%72620s
LightRAGVector + KG89.00%28,44375s
LangChain SQLSQL agent78.00%4,776132s
OpenViking top-5Vector72.75%3,1540.22s
OpenViking top-20Vector91.00%12,5330.23s
Nanobot + OpenVikingVector + agent87.00%71,30061.6s

Single-turn RAG Average

Across FinanceBench, NaturalQuestions, ClapNQ, Qasper, and SyllabusQA, OpenViking reaches 66.87% average accuracy with 0.19s retrieval latency. Its indexing token cost is 8.67M, about 13.8% of LightRAG in this comparison.

MethodAverage accuracyIndexing tokensTokens / QARetrieval latency
Naive RAG53.93%2,755,3561,4350.13s
PageIndex36.75%5,609,206710,48084.60s
HippoRAG 244.50%124,963,61863718.83s
LightRAG76.00%62,705,46927,0359.19s
OpenViking66.87%8,671,5383,0600.19s

What These Numbers Say

The pattern is consistent: OpenViking is strongest when the product needs a context-management loop. Long user history, previous agent attempts, and large knowledge bases all need a layer that can store, narrow, retrieve, and reuse context without forcing every token into the prompt.

The research track is moving in the same direction. The VikingMem paper has been accepted by VLDB 2026, and OpenViking exposes part of that context-database direction as an open-source system developers can try today.