Drew DeVault

April 01, 2025

"If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality.

"These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses —mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure —actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

"We are experiencing dozens of brief outages per week, and I have to review our mitigations several times per day to keep that number from getting any higher.

"When I do have time to work on something else, often I have to drop it when all of our alarms go off because our current set of mitigations stopped working.

"Several high-priority tasks at SourceHut have been delayed weeks or even months because we keep being interrupted to deal with these bots, and many users have been negatively affected because our mitigations can’t always reliably distinguish users from bots."

Search This Blog

chat ai news

Drew DeVault

Comments

Post a Comment

Popular posts from this blog

Hamza Chaudhry

Perplexity

BYU study