In my opinion, the best way of fighting with crawlers is not giving error feedback (403). The best way is to give the crawlers low-quality AI-generated data.
Self plug, but I made this to deal with bots on my site: https://marcusb.org/hacks/quixotic.html. It is a simple markov generator to obfuscate content (static-site friendly, no server-side dynamic generation required) and an optional link-maze to send incorrigible bots to 100% markov-generated non-sense (requires a server-side component.)
I do serve a legit robots.txt file to warn the scrapers I know about away.
I may have a system in place that starts the pipeline for fetching a very, very large file (16TB, text file designed for testing). Not hosted by myself, except the first shard.
A surprising number of agents try to download the whole thing.
Right, and that's why honeypots work against many targets. Why serve them an actual file, when a cgi script or whatever can just generate output in a loop.
In my opinion, the best way of fighting with crawlers is not giving error feedback (403). The best way is to give the crawlers low-quality AI-generated data.
Self plug, but I made this to deal with bots on my site: https://marcusb.org/hacks/quixotic.html. It is a simple markov generator to obfuscate content (static-site friendly, no server-side dynamic generation required) and an optional link-maze to send incorrigible bots to 100% markov-generated non-sense (requires a server-side component.)
I do serve a legit robots.txt file to warn the scrapers I know about away.
I may have a system in place that starts the pipeline for fetching a very, very large file (16TB, text file designed for testing). Not hosted by myself, except the first shard.
A surprising number of agents try to download the whole thing.
Right, and that's why honeypots work against many targets. Why serve them an actual file, when a cgi script or whatever can just generate output in a loop.
Someone has to front the bandwidth.
Ah, speaking of that, of course you don't generate the fake data as fast as you can. You just trickle it out often enough for them not to time out.
That's why you should run a tarpit instead.