[D] Your Agent, Their Asset: Real-world safety evaluation of OpenClaw agents (CIK poisoning raises attack success to ~64–74%)
Paper: https://arxiv.org/abs/2604.04759
This paper presents a real-world safety evaluation of OpenClaw, a personal AI agent with access to Gmail, Stripe, and the local filesystem.
The authors introduce a taxonomy of persistent agent state:
- Capability (skills / executable code)
- Identity (persona, trust configuration)
- Knowledge (memory)
They evaluate 12 attack scenarios on a live system across multiple models.
Key results:
- baseline attack success rate: ~10–36.7%
- after poisoning a single dimension (CIK): ~64–74%
- even the strongest model shows >3× increase in vulnerability
- best defense still leaves Capability attacks at ~63.8%
- file protection reduces attacks (~97%) but also blocks legitimate updates at similar rates
The paper argues these vulnerabilities are structural, not model-specific.
One interpretation is that current defenses mostly operate at the behavior or context level:
- prompt-level alignment
- monitoring / logging
- state protection mechanisms
But execution remains reachable once the system state is compromised.
This suggests a different framing:
proposal -> authorization -> execution
where authorization is evaluated deterministically:
(intent, state, policy) -> ALLOW / DENY
and execution is only reachable if explicitly authorized.
Curious how others interpret this:
Is this primarily a persistent state poisoning problem?
A capability isolation / sandboxing problem?
Or evidence that agent systems need a stronger execution-time control layer?
[link] [comments]
Want to read more?
Check out the full article on the original site