Three scenarios. Four stacks. Ten runs each. Every result is signed with HMAC-SHA256 and available as downloadable JSON. If Sturna underperforms on a metric, you'll see it here.
Each scenario uses its own eval rubric (keyword coverage, citation grounding, correct determination). Every number links to a signed JSON evidence file.
| Metric | Sturna | LangChain + GPT-4o |
AutoGen + GPT-4o |
CrewAI + GPT-4o |
|---|---|---|---|---|
Loading results⦠|
||||
| Metric | Sturna | LangChain + GPT-4o |
AutoGen + GPT-4o |
CrewAI + GPT-4o |
|---|---|---|---|---|
Loading results⦠|
||||
| Metric | Sturna | LangChain + GPT-4o |
AutoGen + GPT-4o |
CrewAI + GPT-4o |
|---|---|---|---|---|
Loading results⦠|
||||
Clone the repo, install dependencies, run one command. Results land in
public/benchmarks-vs/evidence/
as HMAC-signed JSON files. Requires OPENAI_API_KEY.
public/benchmarks-vs/evidence/
Each question/task/document is scored against a defined rubric: keyword coverage (% of required
concepts present), citation grounding (% of claims traceable to source material), and
hallucination detection (presence of specific false numerical claims). Rubrics are open source
in scripts/benchmarks/eval/rubrics/.
Every result file is signed with HMAC-SHA256. The signature covers the full JSON payload
(excluding the signature field itself).
Key: BENCHMARK_SIGNING_KEY env var
(or derived from ADMIN_SECRET).
Verification code in scripts/benchmarks/sign.js.
Model: gpt-4o-2024-11-20 for all stacks.
Temperature: 0. Max tokens: 1024. N=10 runs per scenario per stack.
Hardware: single Node.js process, sequential runs.
Token costs at GPT-4o pricing: $2.50/1M input, $10.00/1M output tokens.
If Sturna underperforms on any metric, those results are published as-is. Credibility comes from accuracy, not from the scoreboard. The supply chain scenario shows the smallest gap β competitors are within 10 points. We don't cherry-pick scenarios where we look best.
Loadingβ¦