The latest OpenViking benchmark update answers a narrow product question: when OpenViking is added as the context layer for agents and RAG workflows, do accuracy, latency, and token cost move in the right direction at the same time?
I: User Memory
LoCoMo tests whether an agent can answer questions that depend on long-range conversation history. The important part is that OpenViking was not measured on one bespoke agent. It was attached to OpenClaw, Hermes, and Claude Code, and all three crossed 80% accuracy.
| Integration | Accuracy | Avg. query time | Input tokens |
|---|---|---|---|
| OpenClaw native memory | 24.20% | 95.14s | 392,559,404 |
| OpenClaw + OpenViking | 82.08% | 38.8s | 37,423,456 |
| Hermes native memory | 33.38% | 82.4s | 79,228,398 |
| Hermes + OpenViking | 82.86% | 27.9s | 52,026,755 |
| Claude Code auto-memory | 57.21% | 49.1s | 353,306,422 |
| Claude Code + OpenViking | 80.32% | 20.4s | 129,968,899 |
The efficiency movement is just as important as the accuracy movement. Compared with each native baseline, latency dropped by roughly 58% to 66%, while token use dropped by 34% to 91%.
| Agent | Accuracy lift | Latency reduction | Token reduction |
|---|---|---|---|
| OpenClaw | 24.20% → 82.08% (+3.39×) | -59.22% | -91.0% |
| Hermes | 33.38% → 82.86% (+2.48×) | -66.10% | -34.3% |
| Claude Code | 57.21% → 80.32% (+1.40×) | -58.45% | -63.2% |
II: Agent Memory
User memory answers “what does this user care about?” Agent memory answers a different question: “what has this agent learned from previous work that should change the next attempt?” The new results look at both economic simulation and task success.
| Setting | Retail accuracy | Airline accuracy |
|---|---|---|
| LLM without memory | 70.94% | 54.38% |
| LLM + OpenViking experience memory | 77.81% (+6.87pp) | 66.25% (+11.87pp) |
III: Knowledge Base QA
Knowledge-base QA is where the trade-off becomes visible. Some systems push accuracy by spending many tokens or accepting high retrieval latency. OpenViking aims for a practical point: strong accuracy with low retrieval latency and controlled indexing cost.
HotpotQA
| Method | Pattern | Accuracy | Tokens / QA | Latency / QA |
|---|---|---|---|---|
| Naive RAG | Vector | 62.50% | 1,290 | 0.11s |
| HippoRAG 2 | Vector + KG | 61.00% | 726 | 20s |
| LightRAG | Vector + KG | 89.00% | 28,443 | 75s |
| LangChain SQL | SQL agent | 78.00% | 4,776 | 132s |
| OpenViking top-5 | Vector | 72.75% | 3,154 | 0.22s |
| OpenViking top-20 | Vector | 91.00% | 12,533 | 0.23s |
| Nanobot + OpenViking | Vector + agent | 87.00% | 71,300 | 61.6s |
Single-turn RAG Average
Across FinanceBench, NaturalQuestions, ClapNQ, Qasper, and SyllabusQA, OpenViking reaches 66.87% average accuracy with 0.19s retrieval latency. Its indexing token cost is 8.67M, about 13.8% of LightRAG in this comparison.
| Method | Average accuracy | Indexing tokens | Tokens / QA | Retrieval latency |
|---|---|---|---|---|
| Naive RAG | 53.93% | 2,755,356 | 1,435 | 0.13s |
| PageIndex | 36.75% | 5,609,206 | 710,480 | 84.60s |
| HippoRAG 2 | 44.50% | 124,963,618 | 637 | 18.83s |
| LightRAG | 76.00% | 62,705,469 | 27,035 | 9.19s |
| OpenViking | 66.87% | 8,671,538 | 3,060 | 0.19s |
What These Numbers Say
The pattern is consistent: OpenViking is strongest when the product needs a context-management loop. Long user history, previous agent attempts, and large knowledge bases all need a layer that can store, narrow, retrieve, and reuse context without forcing every token into the prompt.
The research track is moving in the same direction. The VikingMem paper has been accepted by VLDB 2026, and OpenViking exposes part of that context-database direction as an open-source system developers can try today.

