Show HN: Scorecard – Evaluate LLMs like Waymo simulates cars

docs.scorecard.io

7 points by Rutledge 14 hours ago

Hey HN! I built self-driving sim and eval at Waymo. Now I’m building Scorecard to bring that approach to agent eval: reproducible, automated scoring for AI. Scorecard lets you:

- Run LLM-as-judge evals on agent workflows: test tool usage, multi-step reasoning, and task completion in CI/CD or in a playground.

- Debug failures with OpenTelemetry traces: see which tool failed, why your agent looped, and where reasoning went wrong.

- Collaborate on datasets, simulated agents, and evaluation metrics.

Try it out → https://app.scorecard.io (free tier, no payment required!)

Docs → https://docs.scorecard.io

We’re a small team (4 people), just raised $3.75M, and have early customers using Scorecard for evals in the legal-tech space. We're on a mission to squash non-deterministic bugs. What's the weirdest LLM output you've seen?