# OpenViking Benchmark Update: User Memory, Agent Memory, and Knowledge Base QA

Published: 2026-05-29

OpenViking's May 2026 benchmark update focuses on three scenarios: long-conversation user memory, reusable agent experience memory, and knowledge-base QA. The main question is whether OpenViking can improve accuracy while reducing latency and token cost.

## User Memory on LoCoMo

OpenViking was evaluated as an external memory layer for OpenClaw, Hermes, and Claude Code on LoCoMo long-conversation QA.

Results:

- OpenClaw native memory: 24.20% accuracy, 95.14s average query time, 392,559,404 total input tokens.
- OpenClaw + OpenViking: 82.08% accuracy, 38.8s average query time, 37,423,456 total input tokens.
- Hermes native memory: 33.38% accuracy, 82.4s average query time, 79,228,398 total input tokens.
- Hermes + OpenViking: 82.86% accuracy, 27.9s average query time, 52,026,755 total input tokens.
- Claude Code auto-memory: 57.21% accuracy, 49.1s average query time, 353,306,422 total input tokens.
- Claude Code + OpenViking: 80.32% accuracy, 20.4s average query time, 129,968,899 total input tokens.

Efficiency improvements versus native baselines:

- OpenClaw: 24.20% to 82.08% accuracy, 59.22% lower latency, 91.0% fewer tokens.
- Hermes: 33.38% to 82.86% accuracy, 66.10% lower latency, 34.3% fewer tokens.
- Claude Code: 57.21% to 80.32% accuracy, 58.45% lower latency, 63.2% fewer tokens.

## Agent Experience Memory

Agent memory asks whether an agent can reuse what it learned from previous attempts.

ClawWork result:

- LLM only: $2,269.77 net income after 50 tasks, 1,030.3K tokens per hour.
- LLM + OpenViking: $3,843.74 net income after 50 tasks, 872.4K tokens per hour.

tau2-bench result:

- LLM without memory: 70.94% Retail accuracy, 54.38% Airline accuracy.
- LLM + OpenViking experience memory: 77.81% Retail accuracy, 66.25% Airline accuracy.

## Knowledge Base QA

HotpotQA result:

- Naive RAG: 62.50% accuracy, 1,290 tokens per QA, 0.11s latency per QA.
- HippoRAG 2: 61.00% accuracy, 726 tokens per QA, 20s latency.
- LightRAG: 89.00% accuracy, 28,443 tokens per QA, 75s latency.
- LangChain SQL agent: 78.00% accuracy, 4,776 tokens per QA, 132s latency.
- OpenViking top-5: 72.75% accuracy, 3,154 tokens per QA, 0.22s latency.
- OpenViking top-20: 91.00% accuracy, 12,533 tokens per QA, 0.23s latency.
- Nanobot + OpenViking: 87.00% accuracy, 71,300 tokens per QA, 61.6s latency.

Single-turn RAG average across FinanceBench, NaturalQuestions, ClapNQ, Qasper, and SyllabusQA:

- Naive RAG: 53.93% average accuracy, 2,755,356 indexing tokens, 1,435 tokens per QA, 0.13s retrieval latency.
- PageIndex: 36.75% average accuracy, 5,609,206 indexing tokens, 710,480 tokens per QA, 84.60s retrieval latency.
- HippoRAG 2: 44.50% average accuracy, 124,963,618 indexing tokens, 637 tokens per QA, 18.83s retrieval latency.
- LightRAG: 76.00% average accuracy, 62,705,469 indexing tokens, 27,035 tokens per QA, 9.19s retrieval latency.
- OpenViking: 66.87% average accuracy, 8,671,538 indexing tokens, 3,060 tokens per QA, 0.19s retrieval latency.

## Takeaway

The benchmark pattern is consistent: OpenViking is most useful when the product needs a context-management loop. Long user history, previous agent attempts, and large knowledge bases all benefit from a layer that can store, narrow, retrieve, and reuse context without forcing every token into the prompt.

The VikingMem paper has been accepted by VLDB 2026. OpenViking open-sources part of that context-database direction for developers to try today.