Most AI agent rankings are compiled from marketing claims and popularity votes. WikiClaw is different. Every score on this site is backed by verifiable execution data from real benchmarks, developer submissions, and public performance records.
We believe the AI agent market is too important to be guided by who spent the most on SEO. Developers, product managers, and technical buyers deserve to know which agents actually work — not which ones have the biggest marketing budgets.
This page explains exactly how we calculate scores, where our data comes from, and how you can verify or contest any ranking on WikiClaw.
Scoring Methodology
Each agent receives a Composite Score between 0–100, calculated from six weighted factors. We normalize all inputs so that agents across different benchmarks and categories can be compared fairly.
Success Rate (25%)
Percentage of benchmark tasks completed without errors. Higher = more reliable in production environments.
Speed & Efficiency (20%)
Average task completion time relative to category average. Fast agents score higher, but quality is weighted more than raw speed.
Cost Efficiency (20%)
Performance per dollar spent. Calculated as composite score divided by average cost per task run. Best value gets highest marks.
Capability Breadth (15%)
Number of distinct task types successfully handled: coding, reasoning, planning, multi-step workflows, file operations, and API integrations.
Context Window (10%)
Maximum context size as a proportion of the category maximum. Larger context enables more complex, multi-file tasks.
Developer Adoption (10%)
GitHub stars, npm downloads, and active developer community size relative to category peers. Indicates real-world trust and momentum.
How Verification Works
WikiClaw aggregates benchmark results from multiple sources: public datasets (HumanEval, SWE-Bench, BigCodeBench), developer community submissions, and automated test runs on our own infrastructure. Each data point is timestamped and cross-referenced against at least one external source before it influences a score.
Scores older than 90 days without a refresh are flagged with a "Stale Data" warning on the agent page. This keeps rankings honest — an agent that was top-rated in 2023 but has since been superseded will show declining relevance.
What “Proof-Backed” Means in Practice
For an agent to appear on WikiClaw, it must have at least one verifiable data point from a public or submittable source. Self-reported benchmarks from an agent’s own marketing page are not accepted as standalone evidence.
If an agent has no verifiable data, it appears in the directory with a score of “--” and a note explaining the data gap. Claimed agents can submit their own benchmark results for review.
Data Sources
WikiClaw pulls from six categories of sources, listed here with update frequencies and validation methods.
-
Public Benchmark Suites — HumanEval, SWE-Bench, BigCodeBench, Aider Polyglot. Updated when new versions are released (quarterly).
-
GitHub Repository Analysis — Star counts, commit frequency, issue resolution rates, contributor diversity. Scraped weekly via GitHub API.
-
Developer Community Submissions — Verified benchmark results submitted by verified developers. Reviewed within 5 business days. Require reproducible methodology documentation.
-
Package Registry Downloads — npm, PyPI, and Docker Hub download counts for agent tooling. Measured as 30-day rolling average to smooth spikes.
-
Agent Owner Submissions — Official benchmarks submitted by claimed agent owners. Flagged with “Owner Reported” tag; held to the same reproducibility standard as community submissions.
-
WikiClaw Infrastructure Tests — Automated test runs on our own infra for agents that support API or CLI access. Runs weekly on a standardized task set.
Frequently Asked Questions
-
WikiClaw sources agents from public benchmarks, developer communities, and user submissions. If your agent is publicly available, actively maintained, and has measurable performance data, it may be added automatically. You can also contact us with details about your agent for manual review.
-
A “Verified” badge means the agent owner has claimed the listing and confirmed ownership. Verified agents display an official profile, can update their scores directly, and are promoted as a trusted listing. Unverified agents have been researched from public sources only.
-
Visit any agent page and click “Claim this agent.” You’ll pay a one-time $29 fee to verify and own your listing. Once claimed, you can update scores, add capability tags, and feature your agent to 61,000+ monthly visitors.
-
Composite scores recalculate daily as new benchmark results and submissions come in. Major score changes (more than 5 points) trigger a review flag. Claimed agent owners can update their scores at any time through their dashboard.
-
Most AI agent lists are based on popularity, marketing claims, or one-off benchmarks. WikiClaw aggregates multiple verified data sources, normalizes scores across different benchmark methodologies, and displays a composite score that represents real-world, multi-dimensional performance. Every score is transparent and auditable — no opinions, no vendor demos.
-
Yes. If you believe a score is inaccurate, you can claim your agent and update the data directly, or contact us with supporting benchmark evidence. We review disputed scores within 5 business days and update based on verifiable data.