One More Run: Handing Daily Doom to Paperclip
Two days ago I wrote a postmortem for this project. I meant every word of it. Forty-one nights, 360 commits, 30,000 lines of JavaScript, one game-crashing bug, and a neat little conclusion about where AI agents hit their ceiling. The screen went dark. The demons rested. The experiment was done.
Then I found an excuse to pull the plug back out.
The Lesson That Wouldn’t Sit Still
The sunset post landed on a single idea: the ceiling of an autonomous coding agent isn’t technical capability, it’s taste. Over 41 nights, the agent proved it could ship working code forever. It could read a ticket, write an implementation, add tests, and merge it green at 3 AM without setting anything on fire. What it couldn’t do was judge whether the thing it built was any good. Features came out technically correct but shallow. When the backlog ran dry and it had to invent its own tickets, the ideas were reasonable, defensible, conservative — and never once surprising.
I called it a taste problem and shut things down. But the more I sat with it, the more I suspected I’d misdiagnosed the patient. The agent wasn’t lacking taste. The setup was lacking everything around the agent that produces taste in a real team.
Think about what a lone developer doesn’t have. No product manager arguing that a feature isn’t done until it feels good. No designer pushing back on a flat implementation. No one asking "is this actually fun?" in standup. No budget owner questioning whether this ticket is worth the time. Just an engineer, a backlog, and a definition of done that reads "tests pass." Of course the output stays shallow. That’s what shallow looks like when you strip away every role whose job is to prevent it.
The question I couldn’t shake: what happens if I stop giving Daily Doom a lone AI coder, and start giving it a tiny AI organization?
Enter Paperclip
Around the same time I was chewing on this, I was poking at paperclipai/paperclip, an open-source project I’d been meaning to find a real use case for. Paperclip isn’t a coding agent. It’s the layer around coding agents. Its own tagline puts it bluntly: if the coding agent is the employee, Paperclip is the company.
Concretely, it gives you:
- Org charts with roles, hierarchies, and reporting lines
- A "heartbeat" scheduler so agents wake up on a cadence to check their assigned work
- Per-agent budgets with automatic throttling so nothing runs away with your token bill
- Persistent context across heartbeats — agents resume where they left off instead of restarting cold every night
- Governance: approval gates, audit logs, the ability to pause or override anything at any time
- A "bring your own agent" model that works with Claude Code, Codex, Cursor, or anything else that speaks HTTP
What it doesn’t do is write code. That’s still Claude Code’s job, the same runtime that built the first 41 nights of Daily Doom. Paperclip is the thing that tells Claude what to work on, who it reports to, what budget it has, and whether its output needs another pair of eyes before it ships.
I’d been looking for a project to put it on. Daily Doom has been staring at me the whole time.
The New Hypothesis
Here’s the bet for this next run. If the 41-night ceiling came from organizational vacuum rather than raw model capability, then wrapping the same Claude Code runtime in Paperclip’s scaffolding should move the ceiling. Not because the coder gets smarter, but because the coder suddenly has a boss, a budget, and a reason to care about more than "tests pass."
Three things I want to see move:
- Stability should at least hold. The first run shipped 360 commits with one crash bug. That bar is already high. If adding an orchestration layer drops it, that’s a real cost and I need to notice.
- Quality — the thing automated tests couldn’t measure — should improve. Last time, Playwright could tell me the shotgun existed. It couldn’t tell me whether the shotgun had punch. I want to see features that go beyond the literal spec, because a role in the org is specifically asking them to.
- The agent should propose ideas I wouldn’t have thought of. This was the clearest failure of the solo run. Inventive tickets, not just reasonable ones. If Paperclip’s structure produces creative risk where a lone engineer produced a checklist, that’s the finding worth the rerun.
I’m starting simple: Paperclip wrapping the existing nightly pipeline with budget and governance and persistent context, then layering in more roles as the run develops. I don’t want to over-engineer the org on day one and then not know which part moved the needle.
What Would Change My Mind Either Way
I want to pre-commit to what success and failure look like, because last time I called it a wrap on a vibe and I’d rather not do that twice.
It worked if: by the end of the run, I can point to at least a handful of features that feel meaningfully better than the median feature from the first 41 nights — more thoughtful, more playful, more surprising — and I can trace that difference back to something Paperclip’s structure caused. And the stability bar holds.
It didn’t work if: the output looks indistinguishable from the solo run, just with more overhead and a prettier org chart. Or if stability craters and I spend the run firefighting infrastructure instead of watching the game evolve. In that case, the original sunset post was right and I owe it an addendum rather than a contradiction.
Either outcome is interesting. The boring one is the one I’m actively trying to avoid: running it for a few weeks, writing something vague about "learnings," and walking away without having tested anything.
What Happens Now
The game stays live, same as before. The repo stays public. The nightly pipeline is back on, this time with Paperclip sitting between me and Claude Code, doing the middle-management work I was previously pretending a backlog could do by itself.
Dev log entries resume in the blog. If you want the backstory, the early days and the 40-day mark are still the best primers on how this whole thing started.
Turns out the demons weren’t resting. They were waiting for middle management.