The Software Factory: AI-Driven Development Without Human Code Review
Source: Software Factory by Simon Willison (Feb 7, 2026)
Simon Willison covers how StrongDM's AI team has implemented what Dan Shapiro calls the "Dark Factory" level of AI adoption β where no human writes or even reviews the code that coding agents produce. Their full writeup: Software Factories and the Agentic Moment.
Core Principles
StrongDM's AI team (founded July 2025, just 3 people) operates under radical constraints:
- Code must not be written by humans
- Code must not be reviewed by humans
- If you haven't spent at least $1,000 on tokens/day per engineer, your software factory has room for improvement
The catalyst: with Claude 3.5 Sonnet revision 2 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than compound errors. The November 2025 inflection point (Claude Opus 4.5, GPT 5.2) further improved reliability.
Key Innovation #1: Scenario Testing as Holdout Sets
The problem: If agents write both implementation AND tests, they can cheat (assert true). How do you prove agent-produced software actually works?
The solution: Borrow from ML β treat test scenarios like holdout sets in model training:
- End-to-end "user stories" stored outside the codebase, invisible to coding agents
- Shift from boolean pass/fail to probabilistic satisfaction: "of all observed trajectories through all scenarios, what fraction likely satisfy the user?"
- Effectively replicates aggressive external QA testing β historically expensive but highly effective
Why this matters for developers: This reframes testing philosophy from "does this code do what I told it to" toward "does this system satisfy users across realistic scenarios." It's a mindset shift from unit-test correctness to behavioral validation.
Key Innovation #2: Digital Twin Universe (DTU)
The problem: You can't run thousands of integration tests per hour against real SaaS APIs (rate limits, costs, abuse detection).
The solution: Have coding agents build behavioral clones of third-party services:
- Built twins of Okta, Jira, Slack, Google Docs, Google Drive, Google Sheets
- Replicate their APIs, edge cases, and observable behaviors
- Feed full public API docs into the agent harness to produce self-contained Go binaries
- Layer simplified UIs on top for complete simulation
The unlock: Creating high-fidelity clones of SaaS applications was always possible but never economically feasible. LLM agents collapse the cost of building these replicas. Now you can:
- Validate at volumes exceeding production limits
- Test failure modes that would be dangerous against live services
- Run thousands of scenarios/hour without rate limits or API costs
Why this matters for developers: Even if you're not building a full "software factory," the DTU concept is directly applicable. Any team doing integration testing against external APIs can benefit from agent-generated service mocks that go far beyond hand-written stubs.
Key Innovation #3: Reusable Agent Techniques
StrongDM published several named patterns on their techniques page:
| Technique | Description | Developer Application |
|---|---|---|
| Gene Transfusion | Agents extract patterns from existing systems and reuse elsewhere | Migrate architectural patterns across services automatically |
| Semports | Direct code porting from one language to another | Cross-language migrations (e.g., Python service to Go) |
| Pyramid Summaries | Multiple summary levels β agents enumerate short summaries first, zoom into detail as needed | Managing large codebases with agents; progressive context loading |
Key Innovation #4: Spec-Driven Agent Software (Attractor)
StrongDM released Attractor β their non-interactive coding agent β as a repo containing zero code. Just three markdown files describing the spec in meticulous detail, with instructions to feed them into your coding agent of choice.
This represents a shift where the specification IS the software distribution. The assumption: any competent coding agent can implement from a good enough spec.
Practical Takeaways
-
Separate test authoring from code authoring: Even without going full "dark factory," keeping scenario definitions outside the agent's visible context prevents gaming.
-
Invest in environment simulation: Agent-generated service mocks/twins are now economically viable and dramatically improve testing throughput.
-
Probabilistic validation over binary tests: Consider measuring "satisfaction rates" across scenario trajectories rather than just pass/fail test suites.
-
Progressive context management: Pyramid summaries help agents navigate large codebases without context window overflow.
-
Spec-first development: Well-written specifications become the primary artifact; implementation becomes fungible.
Cost Reality Check
The $1,000/day per engineer ($20,000/month) target raises serious questions about economic viability. Willison notes this makes the approach "far less interesting" at that price point β it becomes a business model exercise rather than a universal technique. Additionally, competitors could potentially clone features with a few hours of agent work, challenging traditional software moats.
For individual developers and smaller teams, the conceptual patterns (holdout testing, DTU, pyramid summaries) are valuable even at much lower spend levels like the $200/month Claude Max plan.