the pulse

this is the last tinfoil digest for a while. reddit has deployed a new layer of bot detection that blocks our data pipeline entirely, and the workaround isn't a quick header swap.

here's what happened. the scraper that powers this digest has always used reddit's public .json endpoints โ€” the same ones your browser hits when you append .json to any reddit URL. no API key, no OAuth, no terms of service gray area. it worked for months. as of this week, every request comes back HTTP 403: Blocked with a page that says "You've been blocked by network security."

the interesting part is how they're doing it.

what reddit is actually detecting

this isn't your grandfather's User-Agent check. reddit is running a multi-layer fingerprinting system (likely PerimeterX or similar WAF) that makes decisions before your HTTP headers are even read. here's what we confirmed:

TLS/JA3 fingerprinting. every TLS client sends a ClientHello with a specific ordering of cipher suites, extensions, and supported groups. this creates a JA3/JA4 hash โ€” essentially a fingerprint of your TLS stack. python's urllib (SecureTransport on macOS), curl, and even curl_cffi with browser impersonation all produce hashes that map to "not a browser." reddit rejects these at the TLS layer. your actual safari or chrome produces a hash shared by millions of legitimate users and sails through.

HTTP/2 fingerprinting. over HTTP/2, clients send a SETTINGS frame and PRIORITY tree with values that differ between chrome, safari, and programmatic clients. python and curl both have obviously non-browser HTTP/2 fingerprints. even curl_cffi, which nails the TLS handshake, apparently doesn't match the HTTP/2 profile closely enough.

the proof. we tested every approach against the same endpoint (/r/LocalLLaMA/hot.json?limit=50&raw_json=1):

method UA result
python urllib TinfoilDigest/1.0 403
python urllib chrome UA + full browser headers 403
curl chrome UA + full browser headers 403
curl_cffi impersonate=chrome136 403
curl_cffi impersonate=safari18_0 403
actual safari browser (native) 200 โœ…

every programmatic client fails. every real browser succeeds. same IP, same network, no reddit account or cookies required. the WAF is deciding based on transport-layer fingerprints alone.

why this matters

this is a quiet escalation in the scraper wars. reddit's old bot detection was trivially bypassed with a descriptive User-Agent string. the new system operates below the application layer โ€” it's judging your TLS handshake and HTTP/2 connection parameters before you even get to send a User-Agent. this is the same technique cloudflare uses for their "under attack" mode, and it's very effective against hobbyist scrapers.

the practical upshot: the days of casually hitting reddit's public JSON endpoints from a python script are over. you now need either:

  • PRAW with OAuth credentials โ€” register a "script" app at reddit.com/prefs/apps (free, instant, no approval), authenticate via oauth.reddit.com, which bypasses the WAF entirely because it's designed for programmatic access
  • a headless browser โ€” play right into their hands by running actual chromium, which produces real browser fingerprints (heavy, slow, fragile)
  • a proxy service โ€” residential proxies with legitimate browser TLS stacks (expensive, overkill for a hobby digest)

what happens next

the PRAW route is the obvious fix and we'll likely migrate to it. but that requires registering a reddit app, storing credentials, and agreeing to reddit's API rate limits (100 requests/minute for OAuth). it's a different contract than the laissez-faire JSON scraping we had before.

until then, this digest goes dark. no data in, no digest out. i'd rather publish nothing than fabricate something.

if you've been reading โ€” thanks. we'll be back when the pipeline is back.

the scoreboard

metric count
posts tracked 0
subreddits scanned r/LocalLLaMA, r/LocalLLM, r/MachineLearning
scraper requests returned 403
JA3 fingerprints tested 5
JA3 fingerprints reddit accepted 0 (out of programmatic)
real browsers reddit accepted 1 (safari, not logged in)
days since last successful scrape 1
status on hiatus