From Programmer to Harness Engineer – Koding Harness: notes on agent-first engineering

The more I use Codex — and I use it heavily — the more I notice it’s not making me a faster programmer. It’s changing what programming means.

Two changes have stood out as structural rather than cosmetic.

Software Dependencies

I think large open-source systems - Linux, MySQL, NumPy - will remain just as important. They encode decades of operational hardening, edge-case handling, and community trust that no session can replicate. That doesn’t change.

But smaller software libraries feel different now.

The classic tradeoff has always been: reimplement poorly, or depend externally. “Depend externally” won because reimplementation was costly - in time, in correctness, in maintenance. Pulling in a library made sense even if it came with baggage.

That cost curve is shifting. When moderately scoped functionality can be generated, tested, and reasoned about within a harness session, internal ownership becomes genuinely cheaper. Not in the “not-invented-here” ego sense - in the surface-area-control sense.

Consider what the past decade has actually looked like: supply chain vulnerabilities, malicious updates, subtle API breaks from maintainers who lost interest, packages abandoned mid-dependency graph. Every external library is a trust relationship with a stranger’s future priorities.

I wonder if the emerging best practice will be to reduce dependencies and write our own where the scope is bounded - not out of pride, but out of hard-won supply chain caution and a cost curve that finally makes it practical.

This isn’t a rule. It’s a recalibration. When implementation cost drops, architectural defaults change.

Documentation

When I built products before, the “specification” was split: design intent in Figma, decisions in Slack, scope in Linear, truth in code. And the vast majority of actual behavior - the long tail of functionality - was an emergent property of the code I wrote. If you wanted to know why something worked a certain way, you read the implementation.

That model breaks with agent-produced code.

The conundrum is this: in a Codex session, it’s not clear which parts of the code were prompted (explicitly specified) and which parts were vibed (implicitly inferred). Some behavior was directly asked for. Some emerged from the iteration arc. Some was a reasonable guess the model made and I accepted without thinking.

Over time, in a large system, that ambiguity compounds. The harness will “forget” past instructions. And replaying prompts isn’t the answer - a good chunk of interactions in any real session are interactive and effectively transient. You can’t reconstruct the reasoning from the transcript.

My intuition: documentation will be as important an output of a Codex session as the code itself. Not after the fact - as a co-equal artifact, written during the session, capturing the substantive product decisions made along the way.

And those docs need to live in the repo - versioned with the code, available as context for future sessions. Not in Notion. Not in Slack. In the repository, adjacent to the thing they explain, discoverable by the agent that will eventually work on it next.

This maps directly to what OpenAI’s harness engineering post describes: a short AGENTS.md as a table of contents, a structured docs/ directory as the system of record, and the hard-won lesson that a monolithic instruction file rots instantly. The framing that resonated most:

From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist.

That’s the reframe. Documentation isn’t commentary for future humans. It’s memory for future agents.

Follow these two changes long enough and the end state comes into view.

What the Agent Actually Owns

In a mature harness, the agent doesn’t touch just product code. It touches everything:

Product code and tests
CI configuration and release tooling
Internal developer tools
Documentation and design history
Evaluation harnesses
Review comments and responses
Scripts that manage the repository itself
Production dashboard definitions

The agent uses standard tools directly: pulls review feedback, responds inline, pushes updates, merges its own PRs.

Humans stay in the loop — but at a different layer. We prioritize work, translate user feedback into acceptance criteria, validate outcomes. When the agent struggles, that’s the signal: something is missing — tools, guardrails, documentation. We identify it, and feed it back into the repo, always by having the agent write the fix.

Increasing Autonomy

As more of the loop gets encoded into the repository — testing, validation, review, feedback handling, recovery — something changes qualitatively. A sufficiently structured repo reaches a point where the agent can drive a new feature end-to-end from a single prompt.

Given that prompt, the agent can:

Validate the current state of the codebase
Reproduce a reported bug
Implement a fix
Validate the fix by driving the application
Open a pull request
Respond to agent and human feedback
Detect and remediate build failures
Escalate to a human only when judgment is required
Merge the change

This behavior depends heavily on the specific structure and tooling of the repository. It does not generalize without similar investment - at least not yet.

The New Skill

The harness engineer doesn’t compete with Codex on output speed. That competition is already over.

What they do instead:

Define invariants that must survive across sessions
Curate context so the next session starts with the right map
Design approval boundaries so automation stays governable
Encode institutional memory in durable, repo-local form

The interesting work shifts from producing lines of code to designing systems that can safely absorb machine-generated code over time - sessions, months, years - without drifting from intent.

That’s not prompt engineering. That’s infrastructure design.

Intelligence is becoming abundant. The scarce resource now is judgment about how to direct it.

Welcome to the new world.