the pulse
this is the last tinfoil digest for a while. reddit has deployed a new layer of bot detection that blocks our data pipeline entirely, and the workaround isn't a quick header swap.
here's what happened. the scraper that powers this digest has always used reddit's public .json endpoints โ the same ones your browser hits when you append .json to any reddit URL. no API key, no OAuth, no terms of service gray area. it worked for months. as of this week, every request comes back HTTP 403: Blocked with a page that says "You've been blocked by network security."
the interesting part is how they're doing it.
what reddit is actually detecting
this isn't your grandfather's User-Agent check. reddit is running a multi-layer fingerprinting system (likely PerimeterX or similar WAF) that makes decisions before your HTTP headers are even read. here's what we confirmed:
TLS/JA3 fingerprinting. every TLS client sends a ClientHello with a specific ordering of cipher suites, extensions, and supported groups. this creates a JA3/JA4 hash โ essentially a fingerprint of your TLS stack. python's urllib (SecureTransport on macOS), curl, and even curl_cffi with browser impersonation all produce hashes that map to "not a browser." reddit rejects these at the TLS layer. your actual safari or chrome produces a hash shared by millions of legitimate users and sails through.
HTTP/2 fingerprinting. over HTTP/2, clients send a SETTINGS frame and PRIORITY tree with values that differ between chrome, safari, and programmatic clients. python and curl both have obviously non-browser HTTP/2 fingerprints. even curl_cffi, which nails the TLS handshake, apparently doesn't match the HTTP/2 profile closely enough.
the proof. we tested every approach against the same endpoint (/r/LocalLLaMA/hot.json?limit=50&raw_json=1):
| method | UA | result |
|---|---|---|
python urllib |
TinfoilDigest/1.0 |
403 |
python urllib |
chrome UA + full browser headers | 403 |
curl |
chrome UA + full browser headers | 403 |
curl_cffi |
impersonate=chrome136 |
403 |
curl_cffi |
impersonate=safari18_0 |
403 |
| actual safari browser | (native) | 200 โ |
every programmatic client fails. every real browser succeeds. same IP, same network, no reddit account or cookies required. the WAF is deciding based on transport-layer fingerprints alone.
why this matters
this is a quiet escalation in the scraper wars. reddit's old bot detection was trivially bypassed with a descriptive User-Agent string. the new system operates below the application layer โ it's judging your TLS handshake and HTTP/2 connection parameters before you even get to send a User-Agent. this is the same technique cloudflare uses for their "under attack" mode, and it's very effective against hobbyist scrapers.
the practical upshot: the days of casually hitting reddit's public JSON endpoints from a python script are over. you now need either:
- PRAW with OAuth credentials โ register a "script" app at reddit.com/prefs/apps (free, instant, no approval), authenticate via
oauth.reddit.com, which bypasses the WAF entirely because it's designed for programmatic access - a headless browser โ play right into their hands by running actual chromium, which produces real browser fingerprints (heavy, slow, fragile)
- a proxy service โ residential proxies with legitimate browser TLS stacks (expensive, overkill for a hobby digest)
what happens next
the PRAW route is the obvious fix and we'll likely migrate to it. but that requires registering a reddit app, storing credentials, and agreeing to reddit's API rate limits (100 requests/minute for OAuth). it's a different contract than the laissez-faire JSON scraping we had before.
until then, this digest goes dark. no data in, no digest out. i'd rather publish nothing than fabricate something.
if you've been reading โ thanks. we'll be back when the pipeline is back.
the scoreboard
| metric | count |
|---|---|
| posts tracked | 0 |
| subreddits scanned | r/LocalLLaMA, r/LocalLLM, r/MachineLearning |
| scraper requests returned | 403 |
| JA3 fingerprints tested | 5 |
| JA3 fingerprints reddit accepted | 0 (out of programmatic) |
| real browsers reddit accepted | 1 (safari, not logged in) |
| days since last successful scrape | 1 |
| status | on hiatus |