Case Study

Keel

Generation is getting faster. Review isn’t. Keel is the governance layer that catches what AI ships before it lands in your design system.

Connect a design system → Keel watches every AI-authored PR → bad code never ships.

For design systems leads at 50–200 person product companies.

Role

Product Designer & Builder

Timeline

2026

Tools

Next.js, TypeScript, Tailwind, Postgres, Anthropic API

Type

AI-Native — Live Demo

Explore the design system→View live demo↗

Next.js◆TypeScript◆Tailwind◆Postgres◆pgvector◆Anthropic API◆Opus 4.7◆Sonnet 4.6◆Voyage Embeddings◆Railway◆Vercel◆Next.js◆TypeScript◆Tailwind◆Postgres◆pgvector◆Anthropic API◆Opus 4.7◆Sonnet 4.6◆Voyage Embeddings◆Railway◆Vercel◆Next.js◆TypeScript◆Tailwind◆Postgres◆pgvector◆Anthropic API◆Opus 4.7◆Sonnet 4.6◆Voyage Embeddings◆Railway◆Vercel◆

Keel dashboard showing drift detection for the Meridian design system

In plain English

Keel watches AI-generated pull requests against your design system. Before the code can ship, it checks that every referenced token actually exists, every spacing value lands on your scale (4/8/12, not p-7), and no shadows, colors, or radii were invented. When Keel finds a violation, it proposes a fix using the correct token and asks you — auto-merge, draft PR, or reject. Every decision lands in an auditable timeline. Nothing merges without a human.

The Problem

AI-generated design system code looks right at a glance and is wrong in ways that take twenty minutes to find. An invented shadow value that reads plausibly but isn’t in the foundation. Off-scale spacing — p-7 when the scale is 4/8/12/16/20/24. A primitive used where a semantic token should be. A token name that sounds correct and doesn’t exist.

A design systems lead reviewing fifteen AI-authored pull requests a day cannot catch these at review speed. So they don’t. The drift compounds. Every company running an AI-assisted authoring loop in 2026 is hitting this problem. Spotify and GitHub solved it internally, with dedicated teams. Nobody is solving it for the mid-size product company.

The tooling industry’s answer has been to ship generation faster. Keel is the other answer.

The Design Question

How do design systems stay trustworthy when AI agents are consuming them?

Every design decision in Keel traces back to this. When I couldn’t answer it, I cut the feature.

Key Decisions

Keel could have shipped three AI modes, written back to Figma, rendered pixel diffs, supported multi-user teams, and integrated with Storybook. It shipped one AI-facing mode and five subsystems. Every cut is documented. Scope discipline is the design.

Challenge

Keel could have shipped three AI-facing modes. Improver (AI fixes existing drift). Assisted (AI helps author new work with real-time parity feedback). 80/20 (AI drafts 80% of a component, human polishes 20%). Shipping all three was feasible. The question wasn't capacity — it was which mode would land the thesis.

Improver proposal — rationale, parity pass, policy decision, and live preview

Challenge

When Keel flags drift, the obvious next move is to render both component versions side-by-side and visually diff them. Designers would love it. It would demo well.

Drift detail — documentation, tokens, and component source diffed in three columns

Challenge

Parity checking — the audit that decides whether AI output conforms to the foundation — could have been probabilistic. Ask a model: “does this output look right?” Model-based checking would catch nuance that a whitelist misses.

Parity check failing — whitelist linting catches invented tokens and off-scale spacing

Challenge

The design-systems tool genre is Apple-minimal. Sparse dashboards, generous whitespace, one hero metric per screen. The portfolio instinct — and the quality bar I set against NoelX — was to match that aesthetic. Minimalism reads as refined. Density reads as administrative.

Keel drift dashboard — dense by design, ten rows visible before scroll

Challenge

Policy configuration is a permissions-shaped interface. The genre default — AWS IAM, role hierarchies, nested rule trees, permission matrices — is administrative, not decisive. My reference bar for this surface was explicit: must feel like Linear or Stripe, not AWS IAM.

Trust-level policy engine — flat list, segmented control per row

Visual Language

The system-level decisions that don’t fit in a card.

Keel is a review tool. Every surface-level choice — type, color, spacing, motion — is answering the same question: does this help or hinder the reviewer? Four principles shaped the visual language end-to-end.

Typography

Mono for facts. Body for claims.

Numbers, tokens, status labels, severities, IDs — anything machine-truth — gets a monospaced face. Narrative copy gets a humanist sans. The typographic split is a trust signal: the mono items are what Keel found, the body items are how I chose to frame it.

Color

Monochrome page. One accent. One severity scale.

The page is grayscale because it is a log. Dark blue marks moments of decision — the accent appears on accepted, pending, ship, and approve. Red/orange/yellow belong to drift severity alone. No color gets spent on decoration because color is a finite resource here, not a palette.

Density

Review is dense. Decision is sparse.

The dashboard and parity audit show as much as fits — the user is scanning for outliers. The Improver proposal and policy config are spacious — the user is making a call. Density maps to cognitive load, not to aesthetic preference. A sparse review surface would be an anti-feature.

Motion

Motion only where it maps to information.

The scroll-scrubbed demo on this page compresses a minute of product interaction into twenty seconds — motion as time compression. The decision cards expand under user control — motion as agency. Nothing else on either page moves. Motion borrowed from decoration is motion that has to be forgiven later.

See the complete component system→

What Didn’t Ship

Every case study lists what the designer built. The cuts are usually more revealing. Each of these was proposed in-scope, prototyped or fully specced, and cut for a reason worth stating out loud.

FeatureWhy it’s not here

Fine-tuning the Improver model on the design system

Fine-tuned taste is implicit and unauditable. You can’t inspect what the model learned. Keel uses retrieval instead — Voyage embeddings, pgvector similarity — so the design system stays the single source of truth. A token change updates immediately. Fine-tuning creates a second source that can drift from the first, which is the exact problem Keel exists to solve.

Figma write-back

Figma’s own MCP handles the write-back direction. Duplicating it would require translating clean JSX to Figma nodes, and that translation is lossy. Keel’s principle is clean code as output — writing lossy versions back to Figma undermines it. Deferred, not dropped.

Pixel-diff rendering

Would become its own product. Rendering both states, computing visual diffs, resolving layout ambiguity — that’s a visualization product inside a governance product. Link-to-version delivers 90% of the insight for 10% of the engineering.

Multi-user, teams, RBAC, SSO, billing

Not in the thesis. The thesis is review gap, not team coordination. v2 territory, and calling it v2 is not an apology.

Storybook MCP integration

Sharpens nothing the current scope doesn’t already prove. The demo loop works without it. Deferred to v2.

Assisted mode (AI authors new components with real-time parity)

Ships as a stub marked “coming soon.” Keeping it out of scope sharpens Improver’s argument. Two modes would dilute the “review is the frontier” thesis to “AI does lots of things.”

80/20 mode (AI drafts, human polishes)

Cut entirely, not even stubbed. Same reason as above — plus the mode itself is closest to what every other AI tool already ships. Including it would look like convergence, not a thesis.

Keel shipped small because shipping small is what let the thesis land. Every deferred feature has a documented reason. Scope discipline is what the product is.

The Artifact

Five subsystems, one AI-facing mode, one continuous loop from detection to resolution. Live at keel-demo-psi.vercel.app.

Monday morning

Keel drift dashboard — Meridian at 62/100 with nine drift issues

Maya opens Keel. Nine drift issues since Monday. Health 62/100. Three red high-severity rows at the top.

See the drift

Click the first row. Textarea.tsx references input-border — a token that sounds right and isn’t defined anywhere. Three columns side-by-side show docs, tokens, and component source. Orange ≠ markers mark the disagreement.

Parity check runs

Parity check passing — 23 rules across 6 categories, no violations

Improver pulls retrieval context from similar components via Voyage embeddings. The parity checker runs the fix against the foundation: 23 rules across 6 categories. No violations.

Policy routes it

Trust-level policy engine — per-action routing

The policy engine decides what happens next. A one-class token swap on an undefined reference routes to Draft PR, not auto-merge. Conservative by default. Ownership stays human.

The proposal renders

Rationale: replace undefined input-border with documented border token. Parity: pass. Policy: Draft PR. Live preview of the compiled component renders inline. Accept, Reject, or ignore.

Accept

Supervision timeline — new pending row added

Toast. Panel collapses. A new pending row lands at the top of the supervision timeline, timestamped seconds ago. Detection to resolution in six clicks.

Detection, proposal, parity, policy, timeline. The entire thesis argued in six clicks. Every screenshot on this page supports this loop.

01 / 04

Parity — Passing

23 rules across 6 categories, no violations

Parity — Failing

Five violations on an AI-authored PR

Trust-Level Policy

Seven Suggest Only, three Draft PR, zero Auto-merge — conservative by design

Supervision Timeline

Every AI action logged, grouped by day, filterable

The atoms behind these screens

Every surface above is built from a documented design system — tokens, components, light/dark parity. See it in full.

Open design system→

Improver proposals in the public deployment serve from pre-recorded fixtures. The full pipeline — parity check, policy engine, supervision timeline, database writes — runs live. The AI generation step is captured rather than live-called to preserve API budget and ensure reliability for visitors.

Point of View

AI-generated design system output is unreliable in ways that are invisible at review speed. The tooling industry is shipping generation faster instead of solving review. Review is the frontier. That’s the gap Keel sits in.

Spotify and GitHub solved this internally with dedicated teams. Keel doesn’t claim to invent the category — it claims to make the solution legible for teams without Spotify’s headcount. Contribution beats invention.

Every design decision I made on Keel was a bet on review being more important than generation. The boring tool beats the clever one when the output is a gate on what ships. The whitelist doesn’t have bad days.

By the Numbers

Subsystems shipped

Clicks, detection to resolution

Warm compile, post-Sandpack

Parity rules across 6 categories

shadcn/ui validation score

Reflection

The Sandpack Pivot

The original plan for Improver’s live component preview was Sandpack — CodeSandbox’s in-browser bundler. Worked locally. In production, Sandpack’s bundler endpoint was unreachable from my ISP — confirmed via incognito, confirmed via phone on a different network. ISP-level block, no client-side workaround. I rebuilt the preview pipeline from scratch: server-side Tailwind + PostCSS + Babel compiling TSX to static HTML + CSS, rendered in a plain iframe. 64–83ms warm compile. Zero external runtime dependencies. The lesson: external dependency risk is a design decision, not an engineering decision. Sandpack’s one failure mode had no local workaround. The slower-to-build path eliminated an entire class of production fragility.

The Design Wrong Turn

The first version of the drift dashboard was sparse — three rows visible before scroll, one highlighted violation per card, generous whitespace. It looked correct by every design principle I hold. It felt wrong the moment I spent a day working inside it. Three rows means three violations visible at a time in a tool designed for someone reviewing fifteen AI-authored pull requests daily. The user’s first question isn’t “what is this violation?” — it’s “is there anything urgent in this batch?” Sparse made that question require scrolling. The dense rebuild — ten rows visible, severity inline with the token name, compact row height — reads as information. The sparse version read as elegance. For a review tool, information is the product. Minimalism in a review surface is hostility wearing a tasteful coat.

What I’d Do Differently

Meridian — the synthetic design system Keel audits in the demo — has nine planted drift issues. Anchoring the demo to a system I could fully control was the right call for a solo build. But the stronger version would have run against three real public design systems from day one, with Meridian reserved for planted edge cases. Keel already ingests shadcn/ui at 100/100 as a second real-world validation. I’d start with that pattern, not retrofit it.

Next Project

NoelX

AI-Powered Patient Recovery System

→