I designed the evaluation layer that helped builders understand, measure, and improve AI document extraction before trusting it in production.

Document Processing · Watson Orchestrate · AI workflow design

Hero visual placeholder: document input → AI extraction → accuracy evaluation → schema improvement.

01 · Overview

The missing loop after extraction

Watson Orchestrate’s document processing capabilities helped teams automate work across business documents like invoices, purchase orders, bills of lading, and utility bills.

I contributed to the broader classifier and extractor experiences, helping shape how builders configured document workflows, selected schemas, and worked with AI-generated extraction results.

But the most important question came after extraction:

How do builders know if the AI is accurate enough to trust?

That became the focus of Accuracy Evaluation, a workflow I owned as the lead designer. The goal was to help builders test extraction quality against ground truth, understand where automation was reliable, and improve their document schemas before deploying workflows at scale.

02 · The Problem

AI extraction was useful, but hard to validate

Enterprise teams needed more than “it worked on a few examples.”

Document extraction could identify fields, pull values, and support automation. But for enterprise teams, “it worked on a few examples” was not enough.

Builders needed to answer questions like:

Without a structured evaluation workflow, teams had to rely on small manual spot-checks and subjective confidence. That made it difficult to responsibly scale document automation.

The problem was not just extracting data. It was making extraction quality measurable, explainable, and improvable.

Before/after placeholder: informal spot-checking → no clear readiness signal → structured evaluation workflow.

03 · Platform Context

Shipped classifier and extractor work shaped the evaluation direction

Before leading Accuracy Evaluation, I tag-teamed work across the shipped document classifier and extractor experiences.

That work included:

This gave me a working understanding of the system: how builders created extractors, how document types shaped schemas, how AI confidence surfaced, and where the experience broke down between testing and deployment.

With classifier and extractor shipped, Accuracy Evaluation became the missing loop.

04 · My Role

Lead designer for Accuracy Evaluation

For Accuracy Evaluation, I led the design direction for the experience.

My role was to define how builders could move from informal testing to a repeatable evaluation workflow:

The design needed to serve technical accuracy goals while still feeling usable to builders who were configuring workflows, not performing data science.

05 · Design Approach

Turning evaluation into an iteration loop

The workflow centered around a simple quality loop:

01Test set
02Ground truth
03Evaluation
04Metrics
05Schema improvement
06Rerun

Instead of treating accuracy as a static score, the design framed evaluation as part of the builder’s iteration process.

A builder could upload a larger set of documents, confirm the expected values, run the extractor, and compare the AI output against ground truth. From there, they could inspect weak fields, understand where extraction failed, and make targeted schema changes.

The goal was not just to show whether an extractor passed or failed. It was to help builders understand what to improve next.

06 · Key Design Decisions

Make accuracy useful, not just visible

Make metrics actionable

Accuracy, precision, recall, and F1 score are useful, but they can quickly become abstract. The design needed to translate those metrics into builder-friendly signals: what changed, what failed, which fields need attention, which documents are causing issues, and whether the extractor is improving over time.

Reuse familiar review patterns

Ground truth creation could have become a completely separate experience. Instead, the direction reused familiar review patterns from human-in-the-loop document review, keeping the experience connected to the broader platform.

Connect failures back to schema changes

When a field performed poorly, the experience needed to connect that result back to field names, descriptions, examples, document variation, or model behavior. Evaluation became a bridge between AI output and better configuration.

Design for uncertainty

AI document processing will always have edge cases. The experience could not imply perfect automation, so the design made uncertainty visible and manageable before production use.

Annotated UI placeholder: field-level metrics, weak-field review, or failure-to-schema improvement moment.

07 · Outcome

A measurable quality loop for document automation

Accuracy Evaluation is planned to release this summer and defines a design direction for making document extraction quality easier to measure and improve.

It connected the broader document processing platform into a clearer loop:

classify → extract → review → evaluate → improve

For builders, this created a path from “the AI extracted something” to “I understand how well it performs, where it fails, and what I can do next.”

For the platform, it helped frame document processing as a more trustworthy enterprise workflow: configurable, reviewable, and measurable.

08 · What This Project Shows

Designing trust into complex AI systems

This project shows my ability to design inside complex AI product systems where trust depends on more than a polished interface.

It reflects my strengths in AI product UX, enterprise workflow design, builder tools, systems thinking, Carbon-aligned product design, translating technical concepts into usable workflows, and owning an ambiguous product area independently.