rustyorb

Agent Evaluation

by rustyorb v1.0.0

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

1,891
Downloads
4
Stars
20
Installs
1
Versions

Latest Changes

Install Agent Evaluation with One Click

Get a managed OpenClaw server and install this skill from your dashboard. No SSH, no Docker, no configuration needed.

Deploy with ClawHost