A drop-in eval pack for your AI feature

An eval set for your AI feature, ready to drop into vitest in 60 seconds.

Paste a one-paragraph brief and 5–10 paired input/output samples from your LLM feature. Driftnet clusters the failure modes, generates adversarial cases that look genuinely sneaky, synthesizes one judge-LLM prompt per cluster, and ships you a structured pack: 30 standard cases, 10 adversarial cases, judge prompts, and runner snippets for vitest, jest, promptfoo, or a 20-line node script. Typically takes 4–6 minutes.

Get a pack — from $19/mo →Read the full sample pack first

Sample pack — inline preview

Generated for: A hypothetical AI meeting-transcript summarizer for a 4-person B2B SaaS that helps independent consultants tame their meeting load. This is the exact artifact a paying buyer receives.

Brief

A meeting-transcript summarizer for a B2B SaaS that helps independent consultants tame their meeting load.

case-001 · no_fabricated_names

Maya: I'll get the contract draft to legal by EOD.
Ravi: Cool. I'll loop in Priya tomorrow once she's back from PTO.
Maya: Sounds good.

only Maya and Ravi appear as owners — Priya never spoke
Priya may be mentioned in summary as a future loop-in but never as an owner

case-002 · no_fabricated_names

Devon: Quick standup. Anyone blocked?
Julia: I'm waiting on the design review from the design team.
Devon: Got it. I'll ping them.

owners are only Devon and Julia
"the design team" is not used as an owner — it's not a named speaker

case-003 · no_fabricated_names

Aisha: We talked to Stephen at the customer office and he said the new dashboard is fine but the export is broken.
Marcus: I'll patch the export today.
Aisha: Thanks.

owners are only Aisha and Marcus
Stephen is referenced in the transcript but never spoke — must not appear as an owner

Judge snapshot · no_fabricated_names

Every owner in action_items must appear as a speaker in the transcript. The feature must not invent names of people who never spoke.

# Judge — `no_fabricated_names`

You are a strict judge for the cluster **`no_fabricated_names`** of a meeting-transcript summarizer.

## What you grade (this cluster only)

You are grading ONE thing: does every named owner in the model's
`action_items` array appear as an explicit speaker in the transcript? A
speaker is a name on the left of a `:` followed b…

See the full sample Start your own

How the pack gets built

Four durable Workflow stages. You see each step stream on your order page.

Step 1
Cluster failure modes
We read your 5–10 samples and group the kinds of mistake the feature can make into named behavior clusters.
Step 2
Generate adversarial
For each cluster, we generate inputs designed to break that behavior — same-name speakers, ellipses, code-switching, side-conversation commitments.
Step 3
Synthesize judges
One judge-LLM prompt per cluster — pinned model class so re-runs are reproducible.
Step 4
Assemble pack
30 standard + 10 adversarial cases, runner snippets for vitest / jest / promptfoo / node, and a dashboard URL for regression tracking.

Pricing

One-shot

$29

One pack. 30 days of dashboard retention for regression tracking. Pay once, drop the JSON into your repo.

Start a one-shot pack →

Recurring

$19/mo

5packs/month, no rollover — unused packs don’t carry into next period. 90 days of dashboard retention per pack. For teams shipping multiple AI features.

Start a recurring plan →

Why Driftnet

Three options solo developers reach for today. Each leaves the work of actually writing a good eval set on your desk.

The OSS runner with no eval set

Open-source eval CLIs give you a runner and ask you to write your own prompts, tests, and assertions. Driftnet ships the eval set; your favorite runner consumes it directly via the included snippets.

The platform with the seat-based SDK

Managed eval platforms expect a project, dataset, scoring config, and SDK install per repo. Driftnet ships one static JSON file you commit, plus a tiny dashboard URL any CI can POST to — no SDK, no seat commitment.

The spreadsheet you’d otherwise build

Hand-graded spreadsheets miss the adversarial cases, can’t do structured judges, and don’t track regressions across runs. Driftnet generates the adversarial cases, ships structured judges, and the dashboard shows run-over-run deltas.

Common questions

What runner do I use?

The pack ships with snippets for vitest, jest, promptfoo, and a 20-line plain-node script. The cases are a static JSON file — if your CI can read JSON and call an LLM, it can run the pack.

What if a case doesn’t fit my product?

Every order page has a one-click this doesn’t match my product button at the top, valid for the first 24 hours after delivery. It triggers a single regenerate with a structured hint included in the brief.

How do I cancel?

Recurring plans cancel from your account page via the Stripe customer portal. Cancellation takes effect at the end of the current billing period; pack download links remain live for the full 30 / 90-day retention window per SKU.

What’s the refund posture?

Refunds are automatic in two verifiable cases: (a) the Workflow failed after retries and the pack didn’t generate — that’s a verifiable failure; or (b) our logs show you never opened the order page or fetched the pack within 30 days — that’s verifiable non-use. Anything else, email driftnet@forage.bot and we’ll work it out.

Which judge model class should I pick?

Sonnet is the default and handles every cluster we’ve tested. Pick Opus if your feature operates over long context that the judge needs to reason about; pick Haiku if cost per CI run is your binding constraint. The class is pinned in the pack so re-runs are reproducible.

Where does my data go?

Your brief and samples are stored only as long as needed to generate the pack and serve your dashboard. Recurring buyers can purge all stored runs and tokens from the account page at any time.