← Blog

Building One Tool a Day — My AI-Powered Open Source Pipeline

7 min read
aiopen-sourcepipelineclaude-code

Why I Started This

"I'm going to build one project every day."

When I first said this out loud, most people had the same reaction: "One project a day? Is that even possible?" Honestly, I wasn't sure either. But I was sure about one thing — AI coding tools had matured enough that if I set the direction and let AI handle the execution, it wasn't impossible.

I built my first project about a month ago. Looking back, it was never just about "writing code fast." It was about building an entire pipeline — from picking ideas, to designing, building, testing, and evaluating — together with AI. Here's where things stand today:

  • 35 projects completed
  • 10,777 tests passing
  • 5 languages used (TypeScript, Python, Go, Rust, Shell)
  • Every project is open source on GitHub

In this post, I want to share how this pipeline works and what I've learned along the way.

The Pipeline Architecture

If you want to ship a project every day, the most important thing is deciding what to build — fast. You can't wait for inspiration to strike. So I built a data-driven system for idea selection.

Step 1: Trend Collection — git-trend-sync

Everything starts with git-trend-sync. It automatically scrapes the GitHub Trending page every day and saves the data as JSON. It tracks which languages are gaining traction, which tool categories are getting attention, and which keywords keep popping up.

With this data, I can judge "what developers actually need right now" based on numbers, not gut feeling. For example, I can spot gaps like "AI agent tooling is booming, but debugging tools for agents are still lacking."

Step 2: Idea Selection — Agent Company

Based on the trend data, five AI agents each propose a project. I call this "Agent Company." Each agent has a unique perspective:

  • Problem Hunter: Finds recurring pain points developers face
  • Trend Scout: Spots emerging opportunities in the trend data
  • Gap Analyst: Analyzes gaps in existing tooling
  • Cross-Pollinator: Applies ideas from other domains to developer tools
  • Community Voice: Collects common complaints from developer communities

After the five agents pitch their ideas, a Red Team steps in to attack each one. "This already exists," "Who would actually use this?," "Can you build this in a week?" — these kinds of questions weed out the weak ideas. The pitching agent has to defend with data, not opinions.

After this process, each idea gets scored, and the highest-scoring one becomes that day's project. I've run 28 rounds so far, adopting one project per round.

Step 3: Build — 3-Phase Harness

Once an idea is selected, the build begins. There's one critical principle here: separate the builder from the evaluator.

Phase 1 — Deep Build: AI builds everything from design to implementation, tests, and README in one deep session. The builder does not evaluate its own work. It just declares "build complete."

Phase 2 — Adversarial Evaluation: An independent evaluator examines the build adversarially. It actually installs the project, runs the tests, executes the CLI, and checks whether README examples work. Stub code? FAIL. Incomplete features? FAIL. Score below 7 out of 10? Sent back for another build.

The build has to pass this evaluation twice in a row to move on.

Phase 3 — Ship: Once it clears the quality gate, it ships.

The key insight behind this structure is solving the problem that "when you evaluate your own work, you go easy on yourself" — structurally.

Featured Projects

Here are a few highlights from the 35 projects.

ctxlens — LLM Context Window Profiler

GitHub

When working with LLMs, it's hard to tell how full your context window is or what's taking up the most space. ctxlens solves this by visualizing token distribution in your prompts and analyzing which parts to trim for the best results. If you run AI agents, you know how much context management matters.

drift-guard — Quality Monitoring MCP Server

GitHub

When you work with AI for extended sessions, quality tends to degrade gradually — I call it "drift." drift-guard is an MCP (Model Context Protocol) server that monitors your AI agent's work quality in real time. It tracks file changes, warns when test coverage drops, and saves session summaries. I'm actually using it on this very homepage project.

flagops — Feature Flag Management

GitHub

Feature flags are used in most production services, but existing tools are either SaaS-centric or too heavy. flagops is a lightweight, local file-based feature flag library. It's covered by 373 tests and written in TypeScript for full type safety.

pipechain — Data Pipeline Builder

GitHub

A Python library for defining and running data processing pipelines in code. You chain stages together and declaratively configure error handling and retry logic. Verified with 613 tests.

Portfolio by the Numbers

| Metric | Value | |--------|-------| | Completed projects | 35 | | Total tests | 10,777 | | Languages used | TypeScript (12), Python (12), Go (8), others | | Agent Company rounds | 28 | | Avg. tests per project | 308 |

What I've Learned

After a month of running this pipeline, here are the key insights.

AI evaluating its own code is too lenient

Early on, the same AI that built the project also evaluated it. The scores were always high. "Looks great! 8/10!" But when I actually ran the code, there were bugs. Once I separated builder and evaluator, and explicitly gave the evaluator the role of "find every possible failure," quality went up dramatically.

This is the same principle behind why code reviews shouldn't be done by the author — it applies to AI too.

Deep builds beat micro rounds

At first, I used 20 short rounds of incremental improvement. "If I keep making small fixes, it'll get better." But in practice, one deep build produced higher quality. A deep build maintains full context and produces a coherent design. Short rounds cause context to fragment, resulting in patchwork code.

Now I do one deep build in Phase 1, followed by adversarial evaluation in Phase 2.

Data-driven idea selection works

Asking "What should I build today?" every day gets exhausting fast. With git-trend-sync feeding data into Agent Company, all I have to do is make the final pick. And because it's data-driven, I end up building tools in areas where there's actual demand — much more effective than starting with "I think this could be cool."

Tests are especially important when working with AI

It's more realistic to validate AI-generated code through tests than to read every line. That's why every project includes at least 100 tests. With tests in place, I can refactor, add features, and fix bugs with confidence. The 10,777 tests aren't just a number — they're the quality guarantee of this pipeline.

What's Next

This pipeline is still evolving. Every day of building also improves the pipeline itself. Recent additions include:

  • drift-guard: Real-time quality drift detection during sessions
  • Adversarial Debate (v5): Added Red Team attack + defense + pre-mortem stages
  • Language diversity: Expanded from TypeScript-heavy to Python and Go

In the next post, I'll dive deeper into how Agent Company works internally — how five AI agents debate, what questions the Red Team asks, and how ideas get selected. Think of it as "how to pick ideas by making AIs argue with each other."


All the projects mentioned here are open source on GitHub.

New projects go up every day. If you're interested, follow me on GitHub or come back to this blog.