Lessons Learned Building an Autonomous AI Pipeline

1. The idea

What if a solo founder could run an entire software company without hiring developers? Not outsource it, not use a no-code tool — but build real software, ship real products, and operate a real pipeline using AI as the full development team.

That is what ARAI is. The name stands for Autonomous Recursive AI — a pipeline that observes its own state, generates tasks, executes them, reviews the results, and learns from failures. Every product on arai.software was built and is maintained by this pipeline. Including this post.

The idea sounds simple. The execution is not.

2. How it began — ChatGPT + Claude + Antigravity

The first version of the pipeline was not a pipeline at all. It was a loop of manual prompting: describe a task in ChatGPT, get a plan, paste it into Claude for implementation, review the diff by hand, apply it. Repeat.

The first real step toward automation came with Antigravity — a MacBook-based tool that let Claude Code run as a supervisor. You could define tasks in structured files and have Antigravity route them through Claude for planning, execution, and review. It worked.

But it had limits. Antigravity ran on a single machine — my MacBook. When the lid closed, the pipeline stopped. Every API call went through Claude's highest-cost tier. And the whole thing required a human to be present and watching.

The delegation hierarchy matters. MacBook-based tools are expensive in both cost and availability. The moment you depend on one machine, you have a single point of failure that shuts down at 11pm.

3. Step 2 — VPS + Claude API

Moving the pipeline to a Hetzner VPS was the first major unlock. Suddenly the pipeline ran continuously, 24/7, without needing my MacBook to be open. Scripts called the Claude API directly, tasks ran on a schedule, and results landed in git without me touching anything.

The architecture simplified too. A Node.js script loaded a task JSON, called Claude with a structured prompt, received a unified diff, validated it, and applied it. The whole thing was about 300 lines of code.

The downsides showed up fast:

Cost per token. Every iteration of every task cost real money. A buggy task that ran 10 iterations could burn through a week's budget in an hour.
Rate limits. The Claude API has tier-based rate limits. Batch runs would hit limits mid-sequence and fail silently.
Budget discipline. Without hard caps, the pipeline would happily spend €200 on a single bad day.

This phase forced the governance layer into existence. Budget caps per venture, audit certificates before each run, and a hard stop when limits were reached. These are now core to how ARAI works.

4. Step 3 — Claude subscription + Claude Code on VPS

The current architecture replaces Claude API calls with the Claude Code CLI running directly on the VPS, authenticated via an OAuth token using claude setup-token.

Instead of calling the API and paying per token, the pipeline spawns claude CLI processes. The model is identical to what runs in the browser. The cost is a flat monthly subscription, not a usage meter.

The advantages are significant:

Fixed, predictable cost regardless of how many tasks run
Higher effective limits than the API's lower tiers
The same model quality as interactive Claude sessions
No API key rotation, no tier upgrades, no surprise bills

The tradeoff is a different kind of limit: usage limits per period. Claude subscriptions reset on a rolling window (roughly 5 hours). If the pipeline runs heavy tasks back-to-back, it will hit the limit and pause. The pipeline must detect this, wait, and resume — rather than failing permanently.

This is now handled explicitly: when a task hits a usage limit error, it is marked iteration_failed with a retry timestamp. A scheduler picks it up after the reset window and re-queues it automatically.

5. Key lessons

After 600+ tasks executed across 5 ventures, here is what actually matters:

Pipeline first. Never edit files directly. Every change must come through a task. The moment you bypass the pipeline, the system stops learning from that change — and you lose the audit trail.
Rate limits are not bugs. They are a constraint to design around. Build wait-and-resume logic into the pipeline from day one. Treating limits as errors leads to fragile systems.
Scope contracts are critical. Small, well-defined tasks go through reliably. Large, vague tasks fail. A task that touches more than two files or exceeds 100 lines of diff is almost always the wrong shape.
Rejected tasks are data. Every failed or rejected task is a signal about what the system cannot handle yet. Logging them as learning events — not just errors — makes the pipeline smarter over time.
Solo founder + AI works — with discipline. The bottleneck is not the AI's capability. It is the quality of the task description. A poorly specified task produces a plausible-looking result that is wrong in ways that take hours to find. Invest in the task, not just the execution.

The pipeline is not magic. It is a system — and like any system, it rewards precision and punishes ambiguity. The more clearly you define what done looks like, the more reliably the pipeline gets you there.