What is autoresearch?

Autoresearch is a four-step loop that AI systems use to improve themselves: hypothesis, experiment, measurement, update. Andrej Karpathy popularized the framing for AI development. Applied to cold email, it means an agent picks a variable to test (subject pattern, opener, CTA), runs an A/B against a real campaign, measures interested-reply rate, and proposes an edit to the policy file driving the system. The loop runs continuously - every concluded experiment compounds into the next.

How is this different from regular A/B testing?

Regular A/B testing optimizes the message. Autoresearch optimizes the policy. Tools like Smartlead and Instantly run two subject lines against each other, pick the winner, and ship. The voice, sequence structure, and proof claims that govern your outbound stay frozen. Autoresearch flips that. The artifact that defines voice, sequence, and proof gets rewritten by its own measured results, with every change committed as a git diff you can audit.

What metric does it optimize for?

Interested-reply rate, not open rate. Open-pixel tracking is the single largest deliverability liability with Gmail and Outlook in 2026 - the pixel is now a spam signal. Modern outbound systems measure intent in the reply instead. AI categorization tags every inbound reply as interested, not interested, out of office, bounce, or uncategorized. Interested replies divided by total sends per variant is the gold metric.

Is this safe to run on production sender accounts?

Yes, with the guards in place. Bounce rate over 5 percent on either arm pauses the variant immediately. Auto-commit is off by default - the agent writes a proposed diff you have to approve. The variable test ladder is strict: subject lines first, openers second, CTA third. Voice tone changes require manual unlock. Most teams run this safely on warmed sender accounts in production after the first month of human-in-the-loop approvals.

Do I need a special tool to run this?

You need three things. A policy file (the cold.md spec is open and free). An agent that runs the loop (the cold.md Claude Code plugin is one implementation, MIT licensed). A sender platform with per-variant reply categorization (FoxReach is the reference implementation today). The spec is portable, so the same policy file can drive a different agent or a different sender if you migrate.

Self-improving cold email: how an autoresearch agent rewrites its own playbook

Most cold email systems peak in week one. The opener that hit a 4 percent reply rate stops working by week eight. Same audience, same offer, same sender accounts. Different week, different result.

This is not a tooling problem. It is a feedback loop problem. Modern cold email runs without one.

This essay is about a different way to think about outbound: as a system that gets smarter every week the same way modern AI training pipelines do. The technology to run it has been quietly assembling for two years. The piece that was missing landed last week. We call it self-improving cold email, and it is a meaningful step toward the future of outbound.

TL;DR

Cold email decays. The pattern that worked at launch reliably stops working within weeks.
Iteration requires testing, statistics, and policy updates. Almost nobody does it.
Self-improving cold email closes the loop. An AI agent picks a variable to test, runs an A/B, measures the right metric, and proposes a diff to the policy file driving the system.
The metric is interested-reply rate, not opens. Open-pixel tracking is dead.
The output is git-versioned. Every change has a measured experiment behind it and human approval until trust is earned.
The result is a cold email system that compounds instead of decays.

Why cold email decays

Reply rates degrade. Anyone who has run outbound at scale for more than a quarter knows this. The numbers move predictably:

Week	Typical reply rate	Why
1-2	4-6%	Fresh pattern, novel angle
3-4	2-3%	Niche starts to recognize the template
5-8	1-2%	Saturation, inbox-filter adaptation
9+	under 1%	Pattern is dead

There are deeper reasons too. Gmail and Outlook tighten filters monthly. Niches saturate as competitors copy patterns. The recipients themselves develop immunity to specific structures. None of this is fixable with better deliverability tooling. The message that worked needs to change, not the infrastructure.

The textbook fix is "iterate." In practice, almost nobody does. Real iteration requires:

Knowing what variable to test next
Designing a clean A/B without confounds
Waiting for the right sample size before reading results
Running an actual statistical test, not eyeballing percentages
Updating the system of record (the policy file or template store)
Doing all of this every two weeks, indefinitely

That list reads like a job description, not a workflow. So most teams do not iterate. They deploy a system in week one, watch it slowly decay, and replace it in month four. Most "AI cold email" tools today optimize the wrong layer of the problem. They speed up draft generation. They do not improve the playbook.

What changed: the autoresearch loop

The shift comes from a framing Andrej Karpathy popularized for AI development: autoresearch as a four-step loop.

Hypothesis. Pick a specific claim worth testing. Experiment. Design a controlled trial that isolates one variable. Measurement. Read a real metric with statistical rigor. Update. Adjust the policy based on what you learned. Then repeat.

This is how modern AI labs improve their own training pipelines. It is also exactly what cold email needs and almost never gets. The claim of this essay is that the same loop, applied to cold outreach, is the right shape for the next generation of these systems.

The loop has four properties that matter:

It is structured. No vibes. The agent does not "improve the email." It tests one variable, with one hypothesis, against one metric, with a pre-declared decision rule.

It is incremental. Earlier variables must reach a stable winner before later ones unlock. No one optimizes the call to action on top of a still-noisy subject baseline.

It is auditable. Every policy change is a diff. Every diff has a measured experiment behind it. You can git bisect your outbound history if something breaks.

It is honest. When the data is inconclusive, the agent says so and extends the sample or stops. It does not pick a winner because the user wants one.

What gets tested, in what order

Self-improving cold email needs a strict test ladder. If the agent gets to invent test dimensions, it produces noise. The right ladder for outbound, in order of signal strength and risk:

Tier	Variable	Why this position
1	Subject line pattern	Fastest signal, lowest risk, drives opens which gates everything downstream
2	Opener template (first two lines)	Drives reply rate once subject works
3	CTA framing (the ask)	Drives qualified replies once people are reading
4	Cadence (days between bumps)	Slow signal, requires multi-week observation
5	Voice tone	Highest risk, requires manual unlock, affects brand

Most teams will see the entire return on investment from Tier 1 alone. A 2 to 3 percentage point lift in subject line interested-reply rate compounds into double-digit booked-call lift over a quarter, because every subsequent step in the funnel benefits from the better top.

Why interested-reply rate, not opens

Open rate is the wrong metric in 2026. Open-pixel tracking - the technique that powers every "X percent open rate" report you have ever seen - is now actively flagged as a spam signal by Gmail and Outlook. Tools that load tracking pixels in cold email reduce their own deliverability. Tools that optimize for open rate are optimizing for a number increasingly disconnected from inbox placement.

Interested-reply rate is the right metric. It works like this:

Every inbound reply gets categorized by an AI step into one of: interested, not interested, out of office, bounce, uncategorized.
Interested replies divided by total sends per variant is the metric.

Two reasons it is better:

It measures intent, not curiosity. Opens tell you the subject line was interesting enough to click. Interested replies tell you the entire message worked.

It is harder to game. Open rates can be inflated by tracking pixel behavior across previewers. Interested replies are conversations a human chose to start.

This shift, from open rate to interested-reply rate, is one of the most important and underappreciated changes in cold email this year. The tools that will dominate the next phase are the ones built on the new metric.

The trust ladder: why you stay in the loop

Letting an AI agent rewrite your sender voice automatically is a bad idea. Letting an agent propose diffs you can approve or reject is fine. The right design is a trust ladder.

When the agent declares a winner, it does not edit the policy file. It writes a proposed diff. You read the diff. You apply it (git apply) or delete it. A counter tracks consecutive approvals. After three approved diffs in a row, auto-commit unlocks. Reject one and the streak resets to zero.

This pattern matters for two reasons.

First, you stay in control while the agent learns your domain. The first three weeks are training wheels: you read every diff and decide. After three good calls in a row, you have evidence the agent's judgment matches yours, and you can let it run.

Second, the audit trail is just git log. Every policy change has a human approval and a measured experiment behind it. When something breaks two months later, you do not have to interrogate a black box. You can read the commit that changed the opener and the experiment that justified the change.

This is how AI agents should be deployed in any production system: earned autonomy, not assumed autonomy.

The policy file as the artifact

Most cold email tools store config in a vendor's database, behind a UI, where you cannot diff or version it. This is the central reason iteration does not happen. There is no artifact to iterate on.

Self-improving cold email requires a policy file. A markdown file in your repo, version controlled, that defines:

Identity: who is sending
Audience: the ICP plus disqualifiers
Value: the offer in one sentence
Voice: the dos and don'ts that govern style
Proof: the facts the agent may cite
Sequence: opener, bump, breakup structure
Objections: the preferred replies to common pushbacks
Banned: phrases that never appear

This is portable. The same file can drive a different agent or a different sender platform if you migrate. The artifact is the system. Tools come and go.

The cold.md spec, an open standard, is the reference for this file. It is CC-BY licensed and lives at cold.md. The same way Markdown beat proprietary docs and OpenAPI beat proprietary API descriptions, a portable cold.md file is the right shape for the policy your outbound runs on.

What this looks like in practice

Imagine running outbound for ninety days under this model.

Week 1. Initial setup. The agent helps build the ICP via web research (validating the target niche, scanning competitor companies, checking title prevalence). It refines the value proposition by searching competitor pricing, recent funding, and pain language on Reddit and review sites. Both feed into the policy file. You ship the first campaign.

Week 2. First experiment is designed: subject line pattern A versus B. Variants are generated, each lead receives one. Sends go out across both arms.

Week 3. The agent reads the experiment. The bounce-rate guard is checked first. If it is breached, the variant pauses immediately and you get an alert. Otherwise, a two-proportion z-test runs on the interested-reply rate. The agent declares winner, inconclusive, or extend.

If a winner: a diff to the policy file is proposed. You review and apply.

Week 4. Tier 1 has a stable winner. The agent advances to Tier 2 and designs an opener experiment. The new policy file (with the locked subject pattern) drives variant generation.

Week 8. Three diffs have been approved. Auto-commit unlocks. The system is now improving itself daily without your direct review, but every change still has a measured experiment behind it and a git log entry. You are no longer the bottleneck.

Week 12. The system is sending mail that almost nothing in week one looked like. The subject pattern is different. The opener is different. The cadence has been tuned. Reply rates are higher than they have ever been, not lower. The system has compounded instead of decayed.

That is the entire promise of this approach: outbound that gets better while you sleep, with a paper trail you can defend.

Why this is the future

A few claims worth taking seriously.

Cold email is becoming an inference problem, not a templating problem. The quality of the message matters less than the quality of the loop that improves the message. Tools that ship better templates are competing on the wrong axis.

The next generation of outbound tools will look like training pipelines. Agents that pick what to test, run controlled experiments, measure rigorously, and update policy. The skill is no longer "writing a great cold email." It is "designing a system that learns to write a great cold email for your specific domain."

The policy artifact is the moat, not the tool. Tools change every two years. Your policy file is the institutional knowledge of your outbound function. If it cannot be diffed, versioned, and migrated, you have no moat at all - you have rented intelligence on a platform that will eventually deprecate.

Earned autonomy beats assumed autonomy. The trust ladder model (propose, approve, accumulate, unlock) generalizes far beyond cold email. It is the right shape for any production AI deployment where mistakes have consequences. Expect to see this pattern in customer support, content moderation, code review, sales coaching, and beyond.

The metric shift is the deliverability story of the decade. Inbox providers have made open tracking actively harmful. Tools built on the new metric (interested-reply rate, intent in the reply) will outperform tools built on the old one for years.

How to think about adoption

If you are running outbound today, three questions to ask:

Do you have a policy file? If your config lives in a vendor's database, you have no artifact to improve. The first move is to write down what you do, in a portable format, in your own repo.

Do you measure the right metric? If your dashboard reports open rate as the primary signal, you are flying on a broken instrument. Switch to reply rate first, interested-reply rate second.

Do you have a feedback loop? If you have not changed your subject line pattern in eight weeks because nobody got around to it, you have no loop. Decide who owns iteration, or automate it.

For most teams, the answers are no, no, and no. That is the gap self-improving cold email closes.

Where the building blocks come from

You can assemble this stack today.

The policy spec is cold.md, open and free. The reference agent implementation is the cold.md Claude Code plugin, MIT licensed. The reference sender substrate is FoxReach, which exposes per-variant reply categorization via its public API. None of these require enterprise contracts. The whole pipeline runs locally with a single API key.

If you want a custom version - tied to your own ICP, custom enrichment providers, integrated with your existing CRM or sender stack - that is exactly the kind of system we build at Buildberg as part of our AI automation, GoHighLevel automation, and analytics practice. Most of our clients land on a hybrid: a custom policy file, our agent loop, their existing infrastructure. Let's talk if that sounds right for your team.

What this is not

A few things to be honest about.

This is not a replacement for an SDR. Closing complex deals, multi-threading enterprise accounts, navigating procurement - those still require humans. Self-improving cold email is a top-of-funnel system. It books meetings. It does not run them.

This is not magic. Bad ICPs produce bad results no matter how good the loop. The agent can refine your value proposition, but it cannot invent a market for a product nobody wants. The discipline of getting the inputs right still matters.

This is not finished. The reference implementation ships with a frequentist statistical test (z-test on proportions). Bayesian methods are coming. The variable test ladder is fixed today. Multi-armed bandit modes will arrive. Per-lead deep research, the kind a strong human SDR does manually, is on the roadmap.

But the shape is settled. Outbound that improves itself, measured against intent in the reply, with a policy file you can diff and a trust ladder you control, is the right structure for the next phase of this work.

The simplest thing you can do this week

Write a cold.md file for your outbound. Eight sections, mostly bullet points, fits in 200 lines. Commit it to a repo.

Even if you never run an experiment on it, you have just done two valuable things. You have made your outbound knowledge portable. And you have built the artifact that the next decade of agents will ride on top of.

The future of cold outreach is not a smarter tool. It is a smarter loop, built around an artifact you own.

Voice AI agents: the complete guide - the other major AI-agent category we deploy heavily
Top 15 AI tools to quit your 9-5 - cold.md is on the list this year
cold.md autoresearch implementation deep dive - the engineering details behind this loop, on the FoxReach blog
cold.md spec homepage - the open standard for portable cold email policy files

Most cold email systems peak in week one. The opener that hit a 4 percent reply rate stops working by week eight. Same audience, same offer, same sender accounts. Different week, different result.

This is not a tooling problem. It is a feedback loop problem. Modern cold email runs without one.

TL;DR

Cold email decays. The pattern that worked at launch reliably stops working within weeks.
Iteration requires testing, statistics, and policy updates. Almost nobody does it.
Self-improving cold email closes the loop. An AI agent picks a variable to test, runs an A/B, measures the right metric, and proposes a diff to the policy file driving the system.
The metric is interested-reply rate, not opens. Open-pixel tracking is dead.
The output is git-versioned. Every change has a measured experiment behind it and human approval until trust is earned.
The result is a cold email system that compounds instead of decays.

Why cold email decays

Reply rates degrade. Anyone who has run outbound at scale for more than a quarter knows this. The numbers move predictably:

Week	Typical reply rate	Why
1-2	4-6%	Fresh pattern, novel angle
3-4	2-3%	Niche starts to recognize the template
5-8	1-2%	Saturation, inbox-filter adaptation
9+	under 1%	Pattern is dead

The textbook fix is "iterate." In practice, almost nobody does. Real iteration requires:

Knowing what variable to test next
Designing a clean A/B without confounds
Waiting for the right sample size before reading results
Running an actual statistical test, not eyeballing percentages
Updating the system of record (the policy file or template store)
Doing all of this every two weeks, indefinitely

What changed: the autoresearch loop

The shift comes from a framing Andrej Karpathy popularized for AI development: autoresearch as a four-step loop.

Hypothesis. Pick a specific claim worth testing. Experiment. Design a controlled trial that isolates one variable. Measurement. Read a real metric with statistical rigor. Update. Adjust the policy based on what you learned. Then repeat.

The loop has four properties that matter:

It is structured. No vibes. The agent does not "improve the email." It tests one variable, with one hypothesis, against one metric, with a pre-declared decision rule.

It is incremental. Earlier variables must reach a stable winner before later ones unlock. No one optimizes the call to action on top of a still-noisy subject baseline.

It is auditable. Every policy change is a diff. Every diff has a measured experiment behind it. You can git bisect your outbound history if something breaks.

It is honest. When the data is inconclusive, the agent says so and extends the sample or stops. It does not pick a winner because the user wants one.

What gets tested, in what order

Self-improving cold email needs a strict test ladder. If the agent gets to invent test dimensions, it produces noise. The right ladder for outbound, in order of signal strength and risk:

Tier	Variable	Why this position
1	Subject line pattern	Fastest signal, lowest risk, drives opens which gates everything downstream
2	Opener template (first two lines)	Drives reply rate once subject works
3	CTA framing (the ask)	Drives qualified replies once people are reading
4	Cadence (days between bumps)	Slow signal, requires multi-week observation
5	Voice tone	Highest risk, requires manual unlock, affects brand

Why interested-reply rate, not opens

Interested-reply rate is the right metric. It works like this:

Every inbound reply gets categorized by an AI step into one of: interested, not interested, out of office, bounce, uncategorized.
Interested replies divided by total sends per variant is the metric.

Two reasons it is better:

It measures intent, not curiosity. Opens tell you the subject line was interesting enough to click. Interested replies tell you the entire message worked.

It is harder to game. Open rates can be inflated by tracking pixel behavior across previewers. Interested replies are conversations a human chose to start.

The trust ladder: why you stay in the loop

Letting an AI agent rewrite your sender voice automatically is a bad idea. Letting an agent propose diffs you can approve or reject is fine. The right design is a trust ladder.

This pattern matters for two reasons.

This is how AI agents should be deployed in any production system: earned autonomy, not assumed autonomy.

The policy file as the artifact

Most cold email tools store config in a vendor's database, behind a UI, where you cannot diff or version it. This is the central reason iteration does not happen. There is no artifact to iterate on.

Self-improving cold email requires a policy file. A markdown file in your repo, version controlled, that defines:

Identity: who is sending
Audience: the ICP plus disqualifiers
Value: the offer in one sentence
Voice: the dos and don'ts that govern style
Proof: the facts the agent may cite
Sequence: opener, bump, breakup structure
Objections: the preferred replies to common pushbacks
Banned: phrases that never appear

This is portable. The same file can drive a different agent or a different sender platform if you migrate. The artifact is the system. Tools come and go.

What this looks like in practice

Imagine running outbound for ninety days under this model.

Week 2. First experiment is designed: subject line pattern A versus B. Variants are generated, each lead receives one. Sends go out across both arms.

If a winner: a diff to the policy file is proposed. You review and apply.

Week 4. Tier 1 has a stable winner. The agent advances to Tier 2 and designs an opener experiment. The new policy file (with the locked subject pattern) drives variant generation.

That is the entire promise of this approach: outbound that gets better while you sleep, with a paper trail you can defend.

Why this is the future

A few claims worth taking seriously.

How to think about adoption

If you are running outbound today, three questions to ask:

Do you have a policy file? If your config lives in a vendor's database, you have no artifact to improve. The first move is to write down what you do, in a portable format, in your own repo.

Do you measure the right metric? If your dashboard reports open rate as the primary signal, you are flying on a broken instrument. Switch to reply rate first, interested-reply rate second.

Do you have a feedback loop? If you have not changed your subject line pattern in eight weeks because nobody got around to it, you have no loop. Decide who owns iteration, or automate it.

For most teams, the answers are no, no, and no. That is the gap self-improving cold email closes.

Where the building blocks come from

You can assemble this stack today.

What this is not

A few things to be honest about.

The simplest thing you can do this week

Write a cold.md file for your outbound. Eight sections, mostly bullet points, fits in 200 lines. Commit it to a repo.

The future of cold outreach is not a smarter tool. It is a smarter loop, built around an artifact you own.

Voice AI agents: the complete guide - the other major AI-agent category we deploy heavily
Top 15 AI tools to quit your 9-5 - cold.md is on the list this year
cold.md autoresearch implementation deep dive - the engineering details behind this loop, on the FoxReach blog
cold.md spec homepage - the open standard for portable cold email policy files

Self-improving cold email: how an autoresearch agent rewrites its own playbook

TL;DR

Why cold email decays

What changed: the autoresearch loop

What gets tested, in what order

Why interested-reply rate, not opens

The trust ladder: why you stay in the loop

The policy file as the artifact

What this looks like in practice

Why this is the future

How to think about adoption

Where the building blocks come from

What this is not

The simplest thing you can do this week

Related reading

Was this article helpful?

Frequently asked questions

Topics

Related Articles

AI Receptionist for Dentrix: What Actually Works in 2026

AI Receptionist for Open Dental: Integration Guide (2026)

Get insights delivered to your inbox

Self-improving cold email: how an autoresearch agent rewrites its own playbook

TL;DR

Why cold email decays

What changed: the autoresearch loop

What gets tested, in what order

Why interested-reply rate, not opens

The trust ladder: why you stay in the loop

The policy file as the artifact

What this looks like in practice

Why this is the future

How to think about adoption

Where the building blocks come from

What this is not

The simplest thing you can do this week

Related reading

Was this article helpful?

Frequently asked questions

Topics

Related Articles

AI Receptionist for Dentrix: What Actually Works in 2026

AI Receptionist for Open Dental: Integration Guide (2026)

Get insights delivered to your inbox