Eval harness, owned by your team
We co-author your evals on day three. Goldens, adversarial sets, drift checks, and a CI gate that blocks any deploy that regresses past your tolerance.
Twelve weeks from kickoff to a fine-tuned model your eval harness accepts and your CISO will sign off on. No prompts leave your boundary, no weights leave your repo, no surprise renewal at month thirteen.
From kickoff to a model your eval gates accept.
Median latency vs. the off-the-shelf baseline we replaced.
Customer prompts in third-party hands. VPC-only by default.
We co-author your evals on day three. Goldens, adversarial sets, drift checks, and a CI gate that blocks any deploy that regresses past your tolerance.
LoRA / QLoRA against your task. Preference data collection, reward modelling, DPO/IPO when the use case warrants. We tell you when not to fine-tune.
Where prompts come from, where outputs go, what runs in your VPC vs. ours. Auditors and CISOs leave the kickoff with the diagram they wanted.
Deploys behind your VPC (vLLM, TGI, Triton, or hosted — your call). Metrics, traces, and prompt-replay built in. PagerDuty wired to your existing rotation.
Telemetry pull, cost model, capability gap analysis. We leave with a written hypothesis and a kill-criterion both sides agreed to up front.
Before any training, we author the gates. Goldens, adversarial sets, drift detectors, latency budgets. If the harness is wrong, the model is wrong.
Best-in-class off-the-shelf model becomes the floor. Iterate against your evals. Each run produces a leaderboard your team can read.
Adversarial robustness, jailbreak defence, PII redaction, prompt injection mitigations. Soak tests against synthetic and real traffic.
Canary deploy, then full rollout behind a feature flag. We hold the pager for 90 days while your team takes over the runbook with us.
| Sprint | EngagementMost picked | Production | |
|---|---|---|---|
| Discovery + capability map | ● | ● | ● |
| Eval harness co-authored | — | ● | ● |
| Custom model fine-tuning | — | ● | ● |
| VPC deployment + observability | — | ● | ● |
| Adversarial hardening | — | — | ● |
| On-call coverage (90 days) | — | — | ● |
| Quarterly upgrade cadence | — | — | ● |
“The eval harness is the artefact we still measure every deploy against, eighteen months later. That alone paid for the engagement twice over.”
We're model-agnostic. Llama, Mistral, Qwen, Claude, GPT — whichever your eval harness picks. We tell you when an off-the-shelf API is the right answer instead of fine-tuning.
Your VPC by default (AWS, GCP, Azure, on-prem). We bring our infra IaC, you bring the credentials. Customer prompts and gradients never leave your boundary.
Fixed-price per phase with a written kill-criterion. No multi-month retainers, no scope creep tax. Numbers in the SOW, not in the back half.
You own the repo, weights, evals, dashboards, and runbook. We hand off via a two-week parallel-rotation. Quarterly upgrade cadence is optional, paid per quarter.
Yes. We've signed BAAs, DPAs, and customer-specific security agreements. Bring your paper.
Sometimes. We compress when the eval harness can be authored in parallel with discovery. We won't compress at the cost of the eval harness — that's the only artifact that survives the engagement.
Twenty-minute call with engineering. We'll tell you in plain terms whether to fine-tune, prompt, distill, or stay with the API.