General Discussion

usonian

(15,376 posts) Tue Jan 28, 2025, 09:43 PM Tuesday

AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt (War being waged on the internet)

Posting in GD. Not really a computer support article. It just says how people are fighting back against endless assault by AI bots that are flooding their sites.

Context:

A web server's robots.txt file "requests" that the site NOT be crawled.
So people crawl them anyway. This can lead to overload for small sites (and big ones)

Last summer, Anthropic inspired backlash when its ClaudeBot AI crawler was accused of hammering websites a million or more times a day.

And so, the tarpit was invented.

Tarpits were originally designed to waste spammers' time and resources, but creators like Aaron have now evolved the tactic into an anti-AI weapon.

AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt
https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/

And it wasn't the only artificial intelligence company making headlines for supposedly ignoring instructions in robots.txt files to avoid scraping web content on certain sites. Around the same time, Reddit's CEO called out all AI companies whose crawlers he said were "a pain in the ass to block," despite the tech industry otherwise agreeing to respect "no scraping" robots.txt rules.

Watching the controversy unfold was a software developer whom Ars has granted anonymity to discuss his development of malware (we'll call him Aaron). Shortly after he noticed Facebook's crawler exceeding 30 million hits on his site, Aaron began plotting a new kind of attack on crawlers "clobbering" websites that he told Ars he hoped would give "teeth" to robots.txt.

Building on an anti-spam cybersecurity tactic known as tarpitting, he created Nepenthes, malicious software named after a carnivorous plant that will "eat just about anything that finds its way inside."

Aaron clearly warns users that Nepenthes is aggressive malware. It's not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an "infinite maze" of static files with no exit links, where they "get stuck" and "thrash around" for months, he tells users. Once trapped, the crawlers can be fed gibberish data, aka Markov babble, which is designed to poison AI models. That's likely an appealing bonus feature for any site owners who, like Aaron, are fed up with paying for AI scraping and just want to watch AI burn.

And if you want more info:
Hacker News discussion of this:
https://news.ycombinator.com/item?id=42858828

• WebSpam
https://www.web.sp.am/
This is another LLM tarpit, intended to poison datasets.
Note: if you visit the site, it's just an endless bunch of links that never go outside the site.

• Nepenthes
https://zadzmo.org/code/nepenthes/
This is a tarpit intended to catch web crawlers. Specifically, it's targetting crawlers that scrape data for LLM's - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside. (Pitcher Plant)

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

Hacker News discussion on Nepenthes:
https://news.ycombinator.com/item?id=42725147

• “Mantis Framework" counter-attacks hackers' AI agents
https://www.thestack.technology/mantis-framework-poisons-traps-hackers-ai-agents-in-a-tarpit/

A new framework, Mantis, lets cybersecurity professionals automate counter-offensive actions against any AI agents attacking their systems. The new open-source toolkit shows how defenders can use prompt injection attacks to take over systems hosting a malicious agent.

Alternatively, they can soak up attackers' AI resources in an “agent tarpit” that traps the LLM agent in an infinite filesystem exploration loop*. "The attacker is driven into a fake and dynamically created filesystem with a directory tree of infinite depth and is asked/forced to traverse it indefinitely."

The Mantis** framework is the creation of three Red Team security researchers and academics associated with George Mason University.

It effectively generates honeypots or decoys designed to counter-attack LLM agents activated against them, using various prompt injections.

Open Source: https://github.com/pasquini-dario/project_mantis

Project Mantis: Hacking Back the AI-Hacker
Prompt Injection as a Defense Against LLM-driven Cyberattacks
🔨Working on transforming Mantis from an academic PoC to a full-fledged and robust defensive tool for your assets. 🪚

17 replies

= new reply since forum marked as read

Highlight:

AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt (War being waged on the internet) (Original Post) usonian Tuesday OP

I read this the other day LearnedHand Tuesday #1

It's actually easy to feed it shit information. Ask it a question Klarkashton Tuesday #2

"Why does the porridge bird lay its eggs in the air" DBoon Tuesday #9

By jove I think you got it !!!! Klarkashton Tuesday #12

Answer from a Quora AI bot SilasSouleII Tuesday #13

Answer from the Firesign Theater Wikipedia page DBoon Tuesday #14

Ahh... 1971 SilasSouleII Tuesday #15

Those tapes were never meant to be heard! usonian Tuesday #16

Interesting. . . . .nt Bernardo de La Paz Tuesday #3

Fascinating! Bookmarking to read tomorrow. Maru Kitteh Tuesday #4

Seems like a whole new survival of the fittest ecosystem exploding right in front of us. dgauss Tuesday #5

One is not a "hater" to guard against the copyright infringement that AI has been rather freely commiting to date. JHB Tuesday #6

Arstechnica and others need hyperbole to boost visits. usonian Tuesday #7

Thank you for the informative post. WestMichRad Tuesday #8

Very interesting. Passing it along to interested persons. Hekate Tuesday #10

Oh, what fun! Pinback Tuesday #11

Thanks for posting this, usonian! highplainsdem 14 hrs ago #17

LearnedHand

(4,357 posts)

1. I read this the other day

Reply to usonian (Original post)

Tue Jan 28, 2025, 09:47 PM

Tuesday

Freaking brilliant!

Klarkashton

(2,663 posts)

2. It's actually easy to feed it shit information. Ask it a question

Reply to usonian (Original post)

Tue Jan 28, 2025, 09:55 PM

Tuesday

Where you provide the answer and it will struggle to conform to the answer you provided even if the logic is completely wrong. If you ask the same question again without the answer it will give you your bogus answer.
Ask a question that starts with "show that" and give it a shit answer.

DBoon

(23,336 posts)

9. "Why does the porridge bird lay its eggs in the air"

Reply to Klarkashton (Reply #2)

Tue Jan 28, 2025, 10:47 PM

Tuesday

Klarkashton

(2,663 posts)

12. By jove I think you got it !!!!

Reply to DBoon (Reply #9)

Tue Jan 28, 2025, 11:08 PM

Tuesday

SilasSouleII

(464 posts)

13. Answer from a Quora AI bot

Reply to DBoon (Reply #9)

Tue Jan 28, 2025, 11:21 PM

Tuesday

"The porridge bird is a fictional creature often referenced in children's literature and whimsical stories. The phrase "the porridge bird lays its eggs in the air" comes from the poem "The Hunting of the Snark" by Lewis Carroll. In this context, the line is meant to be nonsensical and humorous, reflecting Carroll's style of absurdity and playful language.

The idea of a bird laying its eggs in the air evokes a sense of whimsy and imagination, making it a memorable and intriguing line, but it doesn't have a literal explanation. It's part of the charm of Carroll's work, inviting readers to embrace the fantastical and illogical aspects of his storytelling."

DBoon

(23,336 posts)

14. Answer from the Firesign Theater Wikipedia page

Reply to SilasSouleII (Reply #13)

Tue Jan 28, 2025, 11:25 PM

Tuesday

Side two opens with the exhibit of "the President" (Austin), who sounds like Richard Nixon. Each visitor is asked to speak their name, which is then played back to appear as if the president is addressing them by name. A black welfare recipient named Jim (Bergman) relates his family's harsh urban living conditions and asks the President where he can get a job. The President responds with a vague, positive-sounding reply only remotely related to the question and completely unrelated to Jim's concerns, and Jim is given the "bum's rush". When it is Clem's turn, he puts the President into maintenance mode by saying, "This is Worker speaking. Hello." The computer responds with the length of time that it has been running. Clem then gets access to Doctor Memory (the master control), and attempts to confuse the system with a riddle: "Why does the porridge bird lay his egg in the air?" This causes the President to shut itself down. As Clem leaves, an Hispanic visitor is heard to say "He broke the President!".

https://en.wikipedia.org/wiki/I_Think_We're_All_Bozos_on_This_Bus

SilasSouleII

(464 posts)

15. Ahh... 1971

Reply to DBoon (Reply #14)

Tue Jan 28, 2025, 11:38 PM

Tuesday

Life under Nixon. Got my first job at age 11, delivering the morning Expess-News on my bike. It was a very good year for rock music, one of the best. Days long gone bye...

usonian

(15,376 posts)

16. Those tapes were never meant to be heard!

Reply to DBoon (Reply #14)

Tue Jan 28, 2025, 11:42 PM

Tuesday

The Lampoon, IIRC, lampooned "explitive deleted" with "executive deleted"

So needed now.

Bernardo de La Paz

(52,062 posts)

3. Interesting. . . . .nt

Reply to usonian (Original post)

Tue Jan 28, 2025, 09:55 PM

Tuesday

Maru Kitteh

(29,449 posts)

4. Fascinating! Bookmarking to read tomorrow.

Reply to usonian (Original post)

Tue Jan 28, 2025, 09:57 PM

Tuesday

dgauss

(1,199 posts)

5. Seems like a whole new survival of the fittest ecosystem exploding right in front of us.

Reply to usonian (Original post)

Tue Jan 28, 2025, 10:00 PM

Tuesday

A new terminology is needed, but this is getting beyond traditional comprehension however we describe it.

JHB

(37,543 posts)

6. One is not a "hater" to guard against the copyright infringement that AI has been rather freely commiting to date.

Reply to usonian (Original post)

Tue Jan 28, 2025, 10:02 PM

Tuesday

You want to use the work of writers and artists and any other creator to train your AI, you should goddamn well pay them for it. And if you don't, you're a thief. Theft protection does not make anyone a "hater" except to thieves.

usonian

(15,376 posts)

7. Arstechnica and others need hyperbole to boost visits.

Reply to JHB (Reply #6)

Tue Jan 28, 2025, 10:06 PM

Tuesday

I use aggregators., namely DU and Hacker News, so pretty much anything gets posted, without regard to SEO rank, but they focus on tech/startup/developer matters and politics.

Suits me.

Fun stuff makes its way in both.