A Research Agent That Punches Through Anti-Bot Walls

Every blog post I write starts with research, and most research starts with a wall. I ask an agent to go find out what people are saying about some tool, and it reaches for the obvious primitive: fetch the page. Then Cloudflare returns a 403. Reddit hands back a login interstitial. A news site serves a “verifying you are human” challenge. The agent either gives up, or - much worse - cheerfully summarizes the bot-detection page as if it were the article.

A plain HTTP fetch is the right tool for maybe half of the useful web. The other half is gated. So I built a research agent, scout, around the gating instead of pretending it isn’t there. It is one of a few ways I wire research into Claude - the one I reach for when the open web fights back.

The Ladder, Not the Hammer

The core design is an escalation ladder. Each rung is more capable and more expensive than the last, and you only climb when the rung below actually fails.

Plain fetch. A normal HTTP GET with a sane user agent and retries. Free, instant, and still works for a surprising amount of the web. Always try this first.
Residential unlocker. When a request is genuinely blocked, route it through a paid unlocker service (scrape.do in my case) that proxies through residential IPs and solves the bot challenge. This costs real money per request, so it is a fallback, not a default.
Headless browser. For pages that are interactive or render their content with JavaScript, drive an actual browser. Slowest and heaviest, used last.

The discipline is in the only. You do not pay for residential proxies when a free GET would have worked. You do not spin up a browser when an unlocker already returned the HTML. Most research traffic never leaves rung one.

Cheapest-first is the whole trick

The naive version of this picks the most powerful method and uses it for everything - browser-drives every page, or unlocks every URL. It works and it is wildly wasteful. The win is starting at the bottom and climbing only on a confirmed failure. Capability is a cost, not a default.

Detecting a Block Is the Hard Part

The ladder sounds simple until you try to define “failure.” A 403 is easy. The trap is everything that looks like success and isn’t.

You have probably hit a version of this without an agent anywhere near it: you fetch a page, get a clean 200, and hand your code what looks like the article - and it cheerfully processes an empty shell, because it checked the status line and not the contents.

Reddit is the canonical example. Request a normal Reddit URL and you get a clean HTTP 200 - and a body with essentially no content, because the real page is a JavaScript interstitial. If your fetcher only checks the status code, it declares victory and hands the agent an empty shell. The fix was a site-specific quirk: hit Reddit’s .json endpoints instead, which the unlocker can actually get through.

The general problem is worse. An unlocker can leak a Cloudflare challenge page back to you with a 200 status. So block detection cannot trust the status code alone - it has to look at the body: is it suspiciously short, does it contain the markers of a challenge page, is it the content-free version of a site you expected to be rich? Getting this wrong in either direction is costly. False negatives feed the agent garbage; false positives escalate to a paid unlocker when the free fetch already succeeded.

A 200 is not a success. It’s a claim. The hard part of web automation isn’t getting through the wall - it’s knowing whether you actually did.

Some Walls You Don’t Climb

The most important thing the ladder encodes is where it stops. An unlocker defeats bot detection. It does not defeat authentication. A logged-out Reddit thread behind a private community, or a full conversation on X that only renders for signed-in users, is not a harder version of the Cloudflare problem - it is a different problem, and no amount of residential proxy spend will solve it.

So scout flags those as auth walls and stops, rather than retrying with progressively more expensive methods against a door that needs a key, not a crowbar. The escalation ladder has a top, and recognizing the top is what keeps a research run from quietly spending a fortune to fail.

This is also why cost is a first-class control, not an afterthought. Every run shares a credit budget. The agent can fan out across dozens of sources, but it cannot quietly rack up an unbounded unlocker bill, because the budget is enforced across the whole run, not per request.

Show Your Work

The last design rule is the one that makes the output trustworthy: cite everything. A research agent that returns claims without sources is just a confident guess with footnotes-shaped holes. Every fact scout returns carries the URL it came from and how it was fetched. That lets me - or you - check the dubious ones rather than taking the summary on faith.

A live example: the research behind my Headroom post ran entirely through this. It got the GitHub stats and reviews via plain fetch, broke through to Reddit’s JSON for the discussion numbers via the unlocker for a fraction of a cent, correctly flagged the logged-out X threads as auth walls instead of wasting credits on them, and pulled Hacker News sentiment from the Algolia API rather than scraping. Eleven cited sources, total fetch cost under a tenth of a cent.

What This Doesn’t Fix

Worth being honest about the limits.

Unlockers are an arms race, not a guarantee. Residential proxies get through most walls most of the time. A determined site with aggressive fingerprinting will still win sometimes, and the failure can be intermittent.
It cannot read anything that needs a login. By design. Auth walls are a hard stop, and that rules out a lot of genuinely useful private content.
Cited does not mean correct. The agent shows where a claim came from; it does not check that the source is right. A citation makes a wrong answer easy to catch, not impossible to produce.

None of that undoes the core point. The web you actually want to research is half-gated, and the answer is not a bigger hammer. It is a ladder that climbs only when it must, knows which walls have no top, and shows you exactly where every answer came from.

A Research Agent That Punches Through Anti-Bot Walls

The Ladder, Not the Hammer

Detecting a Block Is the Hard Part

Some Walls You Don’t Climb

Show Your Work

What This Doesn’t Fix

Share this article

Related Posts

Claude in Chrome: Close the Dev Loop Without Leaving Your Terminal

Security Review Moved Into the Loop

The Seat Was Never Priced for the Fleet