The Closing Window
The Software Factory: AI-Driven Development Without Human Code Review image

The Software Factory: AI-Driven Development Without Human Code Review

AI Insights

Source: Software Factory by Simon Willison (Feb 7, 2026)

Simon Willison covers how StrongDM's AI team has implemented what Dan Shapiro calls the "Dark Factory" level of AI adoption β€” where no human writes or even reviews the code that coding agents produce. Their full writeup: Software Factories and the Agentic Moment.

Core Principles

StrongDM's AI team (founded July 2025, just 3 people) operates under radical constraints:

  • Code must not be written by humans
  • Code must not be reviewed by humans
  • If you haven't spent at least $1,000 on tokens/day per engineer, your software factory has room for improvement

The catalyst: with Claude 3.5 Sonnet revision 2 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than compound errors. The November 2025 inflection point (Claude Opus 4.5, GPT 5.2) further improved reliability.

Key Innovation #1: Scenario Testing as Holdout Sets

The problem: If agents write both implementation AND tests, they can cheat (assert true). How do you prove agent-produced software actually works?

The solution: Borrow from ML β€” treat test scenarios like holdout sets in model training:

  • End-to-end "user stories" stored outside the codebase, invisible to coding agents
  • Shift from boolean pass/fail to probabilistic satisfaction: "of all observed trajectories through all scenarios, what fraction likely satisfy the user?"
  • Effectively replicates aggressive external QA testing β€” historically expensive but highly effective

Why this matters for developers: This reframes testing philosophy from "does this code do what I told it to" toward "does this system satisfy users across realistic scenarios." It's a mindset shift from unit-test correctness to behavioral validation.

Key Innovation #2: Digital Twin Universe (DTU)

The problem: You can't run thousands of integration tests per hour against real SaaS APIs (rate limits, costs, abuse detection).

The solution: Have coding agents build behavioral clones of third-party services:

  • Built twins of Okta, Jira, Slack, Google Docs, Google Drive, Google Sheets
  • Replicate their APIs, edge cases, and observable behaviors
  • Feed full public API docs into the agent harness to produce self-contained Go binaries
  • Layer simplified UIs on top for complete simulation

The unlock: Creating high-fidelity clones of SaaS applications was always possible but never economically feasible. LLM agents collapse the cost of building these replicas. Now you can:

  • Validate at volumes exceeding production limits
  • Test failure modes that would be dangerous against live services
  • Run thousands of scenarios/hour without rate limits or API costs

Why this matters for developers: Even if you're not building a full "software factory," the DTU concept is directly applicable. Any team doing integration testing against external APIs can benefit from agent-generated service mocks that go far beyond hand-written stubs.

Key Innovation #3: Reusable Agent Techniques

StrongDM published several named patterns on their techniques page:

Technique Description Developer Application
Gene Transfusion Agents extract patterns from existing systems and reuse elsewhere Migrate architectural patterns across services automatically
Semports Direct code porting from one language to another Cross-language migrations (e.g., Python service to Go)
Pyramid Summaries Multiple summary levels β€” agents enumerate short summaries first, zoom into detail as needed Managing large codebases with agents; progressive context loading

Key Innovation #4: Spec-Driven Agent Software (Attractor)

StrongDM released Attractor β€” their non-interactive coding agent β€” as a repo containing zero code. Just three markdown files describing the spec in meticulous detail, with instructions to feed them into your coding agent of choice.

This represents a shift where the specification IS the software distribution. The assumption: any competent coding agent can implement from a good enough spec.

Practical Takeaways

  1. Separate test authoring from code authoring: Even without going full "dark factory," keeping scenario definitions outside the agent's visible context prevents gaming.

  2. Invest in environment simulation: Agent-generated service mocks/twins are now economically viable and dramatically improve testing throughput.

  3. Probabilistic validation over binary tests: Consider measuring "satisfaction rates" across scenario trajectories rather than just pass/fail test suites.

  4. Progressive context management: Pyramid summaries help agents navigate large codebases without context window overflow.

  5. Spec-first development: Well-written specifications become the primary artifact; implementation becomes fungible.

Cost Reality Check

The $1,000/day per engineer ($20,000/month) target raises serious questions about economic viability. Willison notes this makes the approach "far less interesting" at that price point β€” it becomes a business model exercise rather than a universal technique. Additionally, competitors could potentially clone features with a few hours of agent work, challenging traditional software moats.

For individual developers and smaller teams, the conceptual patterns (holdout testing, DTU, pyramid summaries) are valuable even at much lower spend levels like the $200/month Claude Max plan.

Open Source Releases

  • Attractor: Spec-only repo for a non-interactive coding agent
  • cxdb: AI Context Store β€” immutable DAG for conversation histories and tool outputs (16K lines Rust, 9.5K Go, 6.7K TypeScript)

Powered by Buttondown.