Unblockable web access for your AI agent
An AI agent without internet is useless for anything time-sensitive. It can’t look things up, can’t verify claims, and confidently tells you outdated nonsense. I needed my agents to actually browse the web.
The problem: the web fights back. Cloudflare, Akamai, PerimeterX, Datadome — most major sites now block headless browsers, datacenter IPs, and anything that smells automated. Your agent’s web_extract call returns a 403 and now it’s guessing.
I run five Hermes Agent instances on my homelab. Over the past few months I’ve built a web access stack that handles every bot detector I’ve thrown at it.
The stack
Three tools, one proxy chain:
web_searchgoes through Firecrawl to SearXNG (self-hosted, multi-engine)web_extractgoes through Firecrawlbrowser_*goes through Camofox (anti-detection Firefox fork)
SearXNG isn’t a separate tool Hermes calls directly. Firecrawl owns both search and extraction, and SearXNG is its search backend.
All three paths share the same proxy chain:
- Privoxy (
:8118) bridges HTTP to SOCKS5 - SSH SOCKS5 tunnel (
:1080) connects to a residential network - Raspberry Pi on a home connection, exit IP
x.x.x.x
Every request leaves through a residential IP. Bot detectors see a real user on a home network.
1. SearXNG — Firecrawl’s search backend
SearXNG is a meta-search engine that queries Google, Bing, DuckDuckGo, Brave, and others without tracking you. In my setup it’s not standalone — it’s the engine behind Firecrawl’s search API. When Hermes calls web_search, Firecrawl routes the query to SearXNG.
Why not DuckDuckGo?
Hermes ships with DuckDuckGo as the default search backend. It works, but:
- DDG rate-limits hard. A busy agent hits limits within hours.
- Results are often thinner than Google or Bing.
- You can’t pick which engines to query.
- No control over result format.
Firecrawl swaps DDG for SearXNG. You get multi-engine aggregation, no rate limits, and full control. The agent doesn’t know the difference.
Deployment
SearXNG runs inside the Firecrawl Docker Compose stack (section 2 below), not as its own service. The default Firecrawl compose doesn’t include SearXNG — you add these services to the same docker-compose.yaml:
services:
searxng-core:
image: searxng/searxng:latest
restart: unless-stopped
volumes:
- ./settings.yml:/etc/searxng/settings.yml
- ./limiter.toml:/etc/searxng/limiter.toml
networks:
- backend
searxng:
image: nginx:alpine
restart: unless-stopped
ports:
- "8888:8080"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
depends_on:
- searxng-core
networks:
- backend
The nginx proxy in front is critical. More on that in the pitfalls.
Key configuration (settings.yml)
search:
formats:
- html
- json # <- without this, the API returns 403
server:
secret_key: "your-random-secret-here"
limiter: true
outgoing:
proxies:
all://:
- socks5h://127.0.0.1:1080 # route through residential proxy
Pitfalls I hit
JSON format returns 403. SearXNG defaults to HTML-only. If you try the JSON API (/search?q=test&format=json) without adding json to search.formats, you get a 403 Forbidden.
Bot detection blocks API clients. Even with limiter: false, SearXNG blocks requests from curl, wget, and python-requests based on User-Agent strings. Fix: put nginx in front and rewrite the User-Agent:
server {
listen 8080;
location / {
proxy_pass http://searxng-core:8080;
proxy_set_header User-Agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36";
}
}
Docker networks need trust. If you enable the rate limiter, SearXNG needs to know that requests from Docker bridge networks (172.16.0.0/12) aren’t spoofed. Add to limiter.toml:
[botdetection.ip_limit]
trusted_proxies = ["172.16.0.0/12", "192.168.0.0/16"]
SearXNG doesn’t need its own Hermes config — Firecrawl handles it. The Hermes integration is in the Firecrawl section below.
2. Firecrawl — self-hosted web extraction
Firecrawl converts web pages to clean markdown. It handles JavaScript rendering, anti-bot bypassing, and content extraction. The cloud version costs money. Self-hosting it is free and gives you unlimited scraping.
Architecture
Five Docker containers:
| Container | Role | Port |
|---|---|---|
firecrawl-api-1 | Main API + workers | 3002 |
firecrawl-playwright-service-1 | Headless browser for JS rendering | internal |
firecrawl-redis-1 | Caching + rate limiting | internal |
firecrawl-rabbitmq-1 | Job queue | internal |
firecrawl-nuq-postgres-1 | Job persistence + pg_cron | internal |
Deployment
# docker-compose.yml (simplified)
services:
api:
image: ghcr.io/mendableai/firecrawl-api:latest
ports:
- "3002:3002"
env_file: .env
depends_on:
- redis
- rabbitmq
- nuq-postgres
restart: unless-stopped
playwright-service:
image: ghcr.io/mendableai/firecrawl-playwright-service:latest
env_file: .env
restart: unless-stopped
redis:
image: redis:7-alpine
restart: unless-stopped
rabbitmq:
image: rabbitmq:3-management-alpine
restart: unless-stopped
nuq-postgres:
build: src/apps/nuq-postgres # MUST build locally
restart: unless-stopped
The nuq-postgres problem
Don’t use the GHCR pre-built image. It has a pg_cron config mismatch where cron.database_name doesn’t match the init script database. Crashes on startup with "can only create extension in database postgres".
Clone the Firecrawl repo so the src/apps/nuq-postgres directory is available for the compose build:
git clone --depth 1 https://github.com/mendableai/firecrawl.git /root/firecrawl/src
cd /root/firecrawl
docker compose build nuq-postgres
The compose file references build: src/apps/nuq-postgres relative to itself, so the repo needs to be cloned into the src/ subdirectory. The local build picks up the correct postgresql.conf.sample with cron.database_name = 'postgres'.
Proxy configuration
All Firecrawl traffic routes through the residential proxy. In .env:
# proxy for all outbound requests
HTTP_PROXY=http://172.17.0.1:8118
HTTPS_PROXY=http://172.17.0.1:8118
# job queue (mandatory)
NUQ_RABBITMQ_URL=amqp://rabbitmq:5672
# disable auth (self-hosted, no API key needed)
USE_DB_AUTHENTICATION=false
172.17.0.1 is the Docker bridge gateway. It routes to Privoxy on the host at :8118, which forwards through the SSH SOCKS5 tunnel to the residential IP.
Hermes integration
# ~/.hermes/.env
FIRECRAWL_API_URL=http://localhost:3002
# ~/.hermes/config.yaml
web:
extract_backend: firecrawl
search_backend: firecrawl
No API key needed. Self-hosted Firecrawl skips auth when USE_DB_AUTHENTICATION=false.
Resource usage
The limits look worse than reality. Actual numbers:
| Container | CPU limit | RAM limit | Actual RAM |
|---|---|---|---|
| firecrawl-api | 3.0 | 5G | ~2.2G |
| playwright-service | 2.0 | 3G | ~192M |
| searxng (nginx) | — | 128M | ~8M |
| searxng-core | — | 512M | ~156M |
| redis | — | 256M | ~6M |
| rabbitmq | — | 512M | ~198M |
| nuq-postgres | — | 512M | ~104M |
| Total | 5.0 | ~9.6G | ~2.9G |
Limits total ~9.6G but in practice it sits around 3G. If you’re tight on resources, the cloud Firecrawl API (500 pages/month free tier) is probably a better starting point.
3. Camofox — anti-detection browser
Camofox is a Firefox fork with C++-level fingerprint spoofing. It’s not a headless browser pretending to be real. It is a real browser with randomized fingerprints that make each session look like a different person.
Why not Playwright or Puppeteer?
Headless Chromium has tells:
navigator.webdriveristrue- WebGL renderer shows “SwiftShader” (Google’s software renderer)
- Canvas fingerprint is consistent across sessions
- Audio context fingerprint is detectable
- Plugins list is empty
Camofox randomizes all of these. With a residential IP, it’s indistinguishable from a real user.
Deployment
docker run -d --name camofox-browser --network host --restart unless-stopped -e CAMOFOX_PORT=9377 -e PROXY_HOST=localhost -e PROXY_PORT=8118 ghcr.io/jo-inc/camofox-browser:latest
A few things worth noting:
--network hostbecause Camofox needs direct access to localhost for Privoxy- Proxy is configured via ENV vars, not config.json
- uBlock Origin comes built into the image
- Health check:
curl http://localhost:9377/health
Hermes integration
# ~/.hermes/.env
CAMOFOX_URL=http://localhost:9377
With CAMOFOX_URL set, Hermes routes all browser_* calls through Camofox instead of the default agent-browser:
| Tool | Routes through | IP used |
|---|---|---|
browser_navigate, browser_snapshot | Camofox -> Privoxy -> SOCKS5 | Residential |
web_extract | Firecrawl -> Privoxy -> SOCKS5 | Residential |
web_search | Firecrawl -> SearXNG -> Privoxy -> SOCKS5 | Residential |
What it beats
I’ve tested Camofox + residential IP against known aggressive bot detection:
| Site | Protection | Result |
|---|---|---|
| ESPNcricinfo | Proprietary | Full page (was “Access Denied” without) |
| Ticketmaster | Akamai + Queue-It | Full page |
| Nike | Akamai | Full page |
| Instacart | Datadome | Full page |
| Amazon | Custom | Full page (was blocked on datacenter IP) |
| Cloudflare | JS challenge | Passes automatically |
4. The residential proxy chain
This is the part everything else depends on. A residential IP makes bot detection mostly irrelevant. Datacenter IPs get flagged by default, but home IPs don’t.
You need a machine on a home network that stays online. A Raspberry Pi works (what I use), but so does an old laptop, a mini PC, or even a phone running Termux + Tailscale. The only requirement: it has a residential ISP connection and can hold an SSH tunnel open.
Architecture
the host VM (homelab)
|
v
SSH SOCKS5 Tunnel (:1080)
| encrypted, auto-reconnect via systemd
v
Residential Machine (Raspberry Pi, home network)
|
v
Internet (x.x.x.x — residential ISP)
SSH tunnel service
# /etc/systemd/system/socks5-residential.service
[Unit]
Description=SSH SOCKS5 Residential Proxy
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/ssh -D 1080 -N -f -p 6464 user@<TAILSCALE_IP>
Restart=on-failure
RestartSec=15
[Install]
WantedBy=multi-user.target
The residential machine is a Raspberry Pi on a home network, connected via Tailscale. SSH key auth only.
Privoxy (HTTP to SOCKS5 bridge)
Firecrawl and Camofox speak HTTP, not SOCKS5. Privoxy sits in between and forwards everything to the SSH tunnel.
Install it:
apt install privoxy
Edit /etc/privoxy/config, strip out everything except these two lines:
listen-address 0.0.0.0:8118
forward-socks5 / 127.0.0.1:1080 .
That’s the entire config. listen-address binds to all interfaces so Docker containers can reach it. forward-socks5 sends every request to the SSH tunnel on port 1080.
Enable and start it:
systemctl enable --now privoxy
One thing that burned me: Privoxy defaults to listen-address 127.0.0.1:8118. Docker containers on bridge networks can’t reach the host’s loopback, so all your proxy requests fail with ECONNREFUSED. The 0.0.0.0 binding is what fixes it.
Verify it works:
curl -s --proxy http://localhost:8118 https://api.ipify.org
# should show your residential IP, not your server's IP
Privoxy uses about 6MB RAM. You won’t even notice it’s there.
How it all fits together
When the agent needs to look something up:
Search:
Agent calls web_search("proxmox zfs encryption")
-> Firecrawl receives the query
-> Firecrawl routes to SearXNG (queries Google, Bing, DuckDuckGo)
-> Returns aggregated results via Firecrawl API
-> Agent gets titles, URLs, and snippets
Extract:
Agent calls web_extract("https://example.com/article")
-> Firecrawl receives the URL
-> Playwright renders the JS-heavy page (through residential proxy)
-> Returns clean markdown
-> Agent parses the content
Browser:
Agent calls browser_navigate("https://protected-site.com/data")
-> Camofox launches Firefox with randomized fingerprints
-> Requests go through residential proxy
-> Cloudflare JS challenge passes automatically
-> Agent inspects the page with browser_snapshot
Every path goes through the residential IP. Every request looks like a real user on a home network.
Gotchas
The tunnel is a single point of failure
When the residential machine goes offline:
SSH tunnel down -> SOCKS5 :1080 dead
-> Privoxy: CONNECTION REFUSED
-> Camofox browser launch FAILS
-> Firecrawl extracts FAIL
-> All web access: broken
There’s no silent fallback to the datacenter IP. That’s on purpose: a failed request is safer than a request leaking from the wrong IP. The systemd service auto-reconnects within 15 seconds.
Resource requirements
Actual usage is modest:
- Firecrawl + SearXNG: ~2.9G RAM (9.6G limit), 5 CPU cores
- Camofox: ~435MB RAM
- Privoxy: ~6MB RAM
- SSH tunnel: negligible
Total: ~3.3G RAM. The limits are set high for headroom (Firecrawl’s API container can spike under heavy load), but day to day it sits well under 4G.
Residential IP stability
Home IPs can change. If your ISP assigns dynamic IPs, the SSH tunnel breaks when the IP changes. Fixes:
- Dynamic DNS on the residential machine
- Tailscale (what I use), the tunnel connects to a Tailscale IP, not the public IP
- Static IP if your ISP offers one
Geo-specific results
Search results and page content will reflect the residential IP’s location. Mine’s in India, so Google returns India-specific results. Usually fine for technical content, but worth knowing.
Cost
Everything here is open source. The only thing you’re paying for is the residential IP, which is your existing home internet. Total cost: $0/month beyond what you already have.
Cloud alternatives run $600+/month for the same coverage: Firecrawl Cloud at $100+, Bright Data at $500+, ScrapingBee at $50-200.
What I’m still working on
- Automatic failover: if SearXNG is down, fall back to DuckDuckGo; if Firecrawl times out, fall back to Jina Reader
- Search result caching with Redis to reduce load
- Multi-residential-IP rotation across different home networks