Show HN: Scorecard – Evaluate LLMs like Waymo simulates cars
docs.scorecard.ioHey HN! I built self-driving sim and eval at Waymo. Now I’m building Scorecard to bring that approach to agent eval: reproducible, automated scoring for AI. Scorecard lets you:
- Run LLM-as-judge evals on agent workflows: test tool usage, multi-step reasoning, and task completion in CI/CD or in a playground.
- Debug failures with OpenTelemetry traces: see which tool failed, why your agent looped, and where reasoning went wrong.
- Collaborate on datasets, simulated agents, and evaluation metrics.
Try it out → https://app.scorecard.io (free tier, no payment required!)
Docs → https://docs.scorecard.io
We’re a small team (4 people), just raised $3.75M, and have early customers using Scorecard for evals in the legal-tech space. We're on a mission to squash non-deterministic bugs. What's the weirdest LLM output you've seen?