delifue a day ago

In my opinion, the best way of fighting with crawlers is not giving error feedback (403). The best way is to give the crawlers low-quality AI-generated data.

  • marcus0x62 a day ago

    Self plug, but I made this to deal with bots on my site: https://marcusb.org/hacks/quixotic.html. It is a simple markov generator to obfuscate content (static-site friendly, no server-side dynamic generation required) and an optional link-maze to send incorrigible bots to 100% markov-generated non-sense (requires a server-side component.)

    I do serve a legit robots.txt file to warn the scrapers I know about away.

  • shakna a day ago

    I may have a system in place that starts the pipeline for fetching a very, very large file (16TB, text file designed for testing). Not hosted by myself, except the first shard.

    A surprising number of agents try to download the whole thing.

    • kazinator a day ago

      Right, and that's why honeypots work against many targets. Why serve them an actual file, when a cgi script or whatever can just generate output in a loop.

      • andrewmcwatters a day ago

        Someone has to front the bandwidth.

        • kazinator a day ago

          Ah, speaking of that, of course you don't generate the fake data as fast as you can. You just trickle it out often enough for them not to time out.

        • BitPirate a day ago

          That's why you should run a tarpit instead.