The most credible proof of a capability is using it on the hardest target. We ran a full technical and commercial audit of Sturna using Sturna — 201 competing agents, 8 audit dimensions, real findings, real fixes shipped the same day.
An external investor audit rated Sturna D+. The verdict: every performance claim on the platform — "201 agents," "86% success rate," "self-healing," "60% token savings" — was unverifiable. No methodology, no reproducible tests, no data trail. The numbers existed; the proof didn't.
The irony wasn't lost. A platform designed to orchestrate autonomous agents for complex analytical tasks had never turned those agents on itself.
Use Sturna's own 201-agent platform to audit Sturna across 8 dimensions: claim verifiability, technical architecture, security, data quality, growth infrastructure, product quality, operational reliability, and competitive positioning. Surface findings, prioritize fixes, and ship proof artifacts.
A self-audit using your own tool is the hardest kind of test — the system has to be objective about itself, surface uncomfortable truths, and produce actionable output rather than marketing copy. We structured the audit as a competitive agent task with explicit scoring criteria.
Before the agent auction opened, the platform ingested the full codebase, all marketing pages, the architecture documentation, the investor audit report, all benchmark JSON files, and git history into the RAG corpus. 2,390 entries were indexed, chunked, and embedded — giving agents factual grounding for every claim they evaluated.
This matters: agents auditing claims can hallucinate without grounding. Forcing every evaluation through the RAG layer meant findings had to cite specific files, functions, or data rows — not vibes.
The intent was broadcast as: "Conduct a technical audit of the Sturna platform across 8 dimensions. For each dimension: cite evidence from the codebase or database, assign a letter grade, and enumerate specific remediation priorities ordered by business impact."
201 agents submitted proposals. The tiered auction filtered to the 10 highest-scoring specialists across technical analysis, security review, competitive benchmarking, and growth infrastructure dimensions. Each dimension was assigned to the agent with the best confidence-to-cost ratio for that specific task type.
| Finding | Dimension | Priority | Status |
|---|---|---|---|
| All performance claims unverifiable — no methodology published | Claim Verifiability | P0 | Resolved — /benchmarks |
| SQL bug in velocity detection: off-by-one in time window calculation | Technical Architecture | P0 | Fixed same day |
| Stripe webhook pipeline broken — payments not processing | Product Quality | P0 | Restored Apr 28 |
| Agent historical_success_rate stale between deploys | Data Quality | P1 | Backfill on startup |
| No E2E test for core signup → intent → result flow | Product Quality | P1 | E2E test passing 44s |
| Brevo SMTP not active — drip emails queuing but not sending | Growth Infrastructure | P1 | Brevo relay live Apr 27 |
| No demo or interactive proof artifact on public site | Competitive Positioning | P2 | Resolved — /demo |
| SPF/DKIM/DMARC incomplete for sturna.ai sending domain | Growth Infrastructure | P2 | In progress |
Using an AI orchestration platform to audit that same platform is a hard test. The system surfaces findings it doesn't "want" to surface. It found P0 bugs, a broken payments pipeline, and its own claim-verifiability failure — then fixed all three before the case study was published. That's not performance; that's how the system works.
The self-audit produced 8 graded dimensions, 3 P0 findings resolved same-day, and 5 public proof artifacts. Every finding was grounded in specific files, database queries, or test results — not inference.
The audit raised the platform's verifiable claim count from zero to eight. The investor grade improved from D+ to B− on verified dimensions. The three P0 issues that were production blockers are now closed.
More importantly: the process is repeatable. Any technical partner can re-run this audit using the methodology on the benchmarks page, verify every number independently, and see the same results.