Which Work Should AI Actually Handle?¶

A Practical Guide to Business Application Design¶

For / Key Points

For: Engineers, tech leads, and operations owners who need to choose the first AI workflow or PoC.

Key Points:

The first job in AI adoption is deciding which work is safe to hand over
Look at judgment density, input stability, exception frequency, and failure cost
A good PoC is one workflow, one output, one evaluator, and a small sample

Series

Part 0: Where Does Value Move in the AI Era?
Part 1 (this article): Which Work Should AI Actually Handle?

AI can write. So should every writing task go to AI?

That is usually where adoption goes wrong.

Contract summaries, incident report triage, support reply drafts, and weekly sales report aggregation all involve text. But the conditions for handing them to AI are completely different.

The question of this article is: which work should AI enter, and which work should remain outside for now?

What AI Can Do Is Not the Same as What It Should Do¶

The first thing to inspect is not model capability. It is whether humans can check the output, understand the risk, and recover when the output is wrong.

Internal meeting summaries are relatively easy to hand to AI. The input is a transcript or notes, the output is a summary, and the reader can compare it with the source. If wording drifts, the damage is usually limited.

Customer-facing final answers or legal judgment are different. If the output goes outside the company, a mistake can become a trust, contract, or liability problem. Both are writing tasks, but the business weight is different.

The skill here is not knowing what AI is good at. It is reading the nature of the work.

What Strong Operators Look At¶

People who place AI well do not start with time savings alone. They first locate where judgment happens.

A time-consuming task can be a good candidate. But the reason it takes time matters. If the work is repetitive transfer, AI may help directly. If the work takes time because a domain expert is judging ambiguous cases, AI should usually assist rather than decide.

Four criteria are enough for a first pass.

Judgment density: how much human judgment is inside the output
Input stability: whether information arrives in a similar format each time
Exception frequency: how often cases break the normal pattern
Failure cost: who is affected if the output is wrong, and how hard recovery is

McKinsey frames AI value capture around more than tool rollout: adoption, scaling, KPIs, and workflow change matter to business outcomes¹. In other words, AI adoption is not just about increasing usage. It is about redesigning how work flows.

The NIST AI RMF Core also asks organizations to define business context and the specific tasks and methods an AI system will support². So inspecting the nature of the work is not only an adoption concern. It is also the entry point for risk management.

Four Buckets Make the Decision Easier¶

Business application design should not be a yes-or-no question. The decision becomes easier when work is split into four buckets.

The four criteria map directly to these buckets. Low judgment density, stable input, and easy verification point toward handing work to AI. High judgment density or unstable input points first toward AI assistance. High failure cost with unclear evaluation conditions should stay with humans or remain untouched for now.

Bucket	Rule of thumb	Example
Hand to AI	Output is easy to verify	Weekly report aggregation, FAQ update drafts
Assist with AI	Judgment stays with humans	Incident report triage, support ticket classification
Keep with humans	Error responsibility is heavy	Contract judgment, final customer response
Do not touch yet	Inputs or evaluation conditions are not ready	Cross-department exception handling, unstructured data

The point is that “hand to AI” is not the only success state.

AI assistance can still reduce work. In support operations, AI does not need to write the final answer. It can summarize history, classify the ticket, and suggest relevant FAQ entries before a human decides.

In practice, even a four-column whiteboard helps. Put candidate workflows on sticky notes, then move them across the columns while checking judgment density, input stability, exceptions, and failure cost. The conversation shifts from “is this suitable for AI?” to “how much of this can we hand over?”

For example, “automate support operations” is too broad. But “classify incoming inquiries before human review” is specific enough to inspect. Judgment density is moderate, the input is reasonably stable, exceptions exist, and the output can be checked by a person. That points toward AI assistance, not full delegation of the final reply.

Criterion	What to inspect in first-line inquiry classification	Decision
Judgment density	Category selection is needed, but it is not the final decision	Good fit for AI assistance
Input stability	Inquiry text, subject, and past history are available	Testable
Exception frequency	Complaints, legal issues, and cancellations need routing	Add exception labels
Failure cost	Misclassification can be caught during human review	Safe to test small

The unit is not the whole workflow. It is one output inside the workflow. Once the output is that small, the PoC becomes much more realistic.

Anthropic recommends starting with the simplest possible solution and adding complexity only when it is needed³. The same rule applies to business application design. Before building an autonomous agent, start with preparing material, drafting, or suggesting candidates.

flowchart TD
    A[Inspect the workflow] --> B{Is judgment density high?}
    B -- Low --> C{Is input stable?}
    C -- Yes --> D{Are exceptions frequent?}
    D -- Few --> E[Candidate to hand to AI]
    D -- Many --> F[AI assistance candidate<br/>detect exceptions and return to humans]
    C -- No --> G[AI assistance candidate<br/>start with input preparation]
    B -- High --> H{Is failure cost high?}
    H -- Low --> I[AI assistance candidate<br/>prepare decision material]
    H -- High --> J[Keep with humans / Do not touch yet]

This flow is not a universal answer key. It is a way to reduce vague debate when choosing PoC candidates.

The “assist with AI” bucket contains two different cases. When judgment density is high, AI prepares material for a human decision. When input is unstable, the first design problem is preparing the input before AI can help. When exceptions are frequent, the first design problem is detecting exceptions and returning them to humans before normal processing. The label is the same, but the design work is different.

How to Cut the First PoC¶

The larger the PoC, the easier it is to fail.

“Automate sales reporting with AI” is too broad. Sales reporting includes weekly reports, meeting notes, account summaries, forecasts, and management updates. Each has different inputs, responsibilities, and risks.

Start with one workflow, one output, one evaluator, and a small sample.

One workflow: only classify support inquiries
One output: category name and reason
One evaluator: a domain expert decides pass or fail
Small sample: start with 20 to 50 historical cases

Small does not mean ignoring production. A PoC should test more than one-off accuracy.

Will exceptions grow when volume increases? Will human review become too heavy? Will approval, permission, or access boundaries block the workflow? A small PoC is useful because it exposes production risks early.

OpenAI’s enterprise AI report describes growth in structured, repeatable workflows such as Projects and Custom GPTs inside its Enterprise usage data⁴. That does not mean Custom GPTs are the universal enterprise standard. The important shift is from one-off chat use toward reusable workflow units.

Microsoft Copilot Studio is moving in the same direction by combining AI agents and workflows to automate business processes⁵. The vendor examples differ, but the pattern is similar. AI is increasingly designed around repeatable units of work, not isolated answers.

A PoC should follow the same logic. The question is not whether one impressive output appears. The question is whether humans can evaluate repeated outputs under the same conditions.

What Gets Cheaper, and What Does Not¶

This layer lowers the cost of drafting, aggregation, transformation, and candidate generation.

Collect weekly reports. Extract key points from a long incident report. Classify incoming support tickets. Draft an FAQ update. These tasks become easier when input and output shapes are visible.

The boundary-setting work does not get cheaper in the same way. The same draft can mean different things when it is an internal FAQ update or a final answer to a customer. The same classification can carry different risk when it is an internal tag or a legal-risk judgment.

Which workflow should be targeted, how much should AI handle, and what standard counts as usable? Those are business application design questions.

As execution gets cheaper, the decision about what to execute becomes heavier.

What Strong People See in This Layer¶

Strong business application designers do not only inspect AI output. They inspect workflow flow, judgment location, failure cost, and the smallest evaluable unit.

They also think in sequence, not only feasibility. The first workflow is not always the one with the largest theoretical impact. It is the one where failure is safe, evaluation is clear, and the next expansion is visible.

They can say:

“This can be handed to AI.” “This should stop at drafting.” “This must stay with humans.” “This should not be touched yet.”

Once those lines are clear, the conversation changes. It moves from “should we use AI?” to “under which conditions can we test it safely?” People who start with safely fail-able work often scale farther in the end.

The next problem appears when multiple people use AI together. One person can decide “this stops at drafting.” In a team, people can disagree on classification rules, naming, ownership, and review standards. If the boundary is not shared, faster AI use can create more drift. That is the next layer: team design.

Enterprise AI — A starting point for workflow selection, evaluation, and platform design.
RAG / Context Engineering — A design lens for passing internal documents and operational knowledge to AI systems.