General Discussion

highplainsdem

(62,136 posts) Tue Apr 25, 2023, 09:01 PM Apr 2023

An AI Scraping Tool Is Overwhelming Websites With Traffic (Vice)

https://www.vice.com/en/article/dy3vmx/an-ai-scraping-tool-is-overwhelming-websites-with-traffic

The creator of a tool that scrapes the internet for images in order to power artificial intelligence image generators like Stable Diffusion is telling website owners who want him to stop that they have to actively opt out, and that it’s “sad” that they are fighting the inevitable rise of AI.

-snip-

Img2dataset is a free tool Beaumont shared on GitHub which allows users to automatically download, and resize a list of URLs. The result is an image dataset, the kind that trains image-generating AI models like Open AI’s DALL-E, the open source Stable Diffusion model, and Google’s Imagen. Beaumont is also an open source contributor to LAION-5B, one of the largest image datasets in the world that contains more than 5 billion images and is used by Imagen and Stable Diffusion.

Img2dataset will attempt to scrape images from any site unless site owners add https headers like “X-Robots-Tag: noai,” and “X-Robots-Tag: noindex.” That means that the onus is on site owners, many of whom probably don’t even know img2dataset exists, to opt out of img2dataset rather than opt in.

-snip-

“I noticed because I received an alert from my host that the site was under a sustained attack,” Eden said. “I had to pay to scale up my server, pay extra for export traffic, and spent part of my weekend blocking the abuse caused by this specific bot.”

-snip-

Much more at the link.

The smug arrogance of a lot of the people behind this abuse by and for AI is disgusting. This one suggests website owners just shut down their websites if they don't want all the images scraped by AI tools. And he said it would be "unethical" if the scraping was limited to those websites that opted in, because that would be "letting a small minority prevent the large majority from sharing their images and from having the benefit of last gen AI tool."

Btw, if I'm reading this correctly, since this tool is provided for free, it can potentially be used by at least thousands of individuals and businesses to scrape images from sites. Think what that can do to the traffic, and the cost to the website owner.

And the jerk who created this scraping tool not only won't consider making it opt-in, but has provided users of his tool with an option that lets them IGNORE the code telling them not to scrape a website, according to some of the comments posted at https://github.com/rom1504/img2dataset/issues/293 .

The fact that you then provide an option for users of your tool to then even disregard the choice of people who do explicitly remove consent is an alarming red flag.

I'm sorry, but your logic is totally flawed, it depends on image owners knowing your tool exists before it indexes their site.

At the point they discover it (by wondering what is hammering their site and reading their logs), and then manage to find your readme and add the relevant header it's already too late as your tool has most likely already totally ingested their content without any consent.

Add to the fact you document an option to directly ignore the flags they could use to opt-out.

13 replies

= new reply since forum marked as read

Highlight:

An AI Scraping Tool Is Overwhelming Websites With Traffic (Vice) (Original Post) highplainsdem Apr 2023 OP

Just another pirate sailing the electron sea. Like bitcoin miners. marble falls Apr 2023 #1

I wonder if the WC3 has an opinion on this blogslug Apr 2023 #2

If you think Russian bots were a problem in previous elections, you ain't seen nothin' yet. progressoid Apr 2023 #3

With this and the other AI generated slightlv Apr 2023 #4

Right wing loons are already susceptible to conspiracies. progressoid Apr 2023 #12

I see they're already at it... slightlv Apr 2023 #13

Sounds like AI... 2naSalit Apr 2023 #5

We need lawsuits and regulations pronto LostOne4Ever Apr 2023 #6

I was thinking pretty much the same. soldierant Apr 2023 #7

+1,000,000 highplainsdem Apr 2023 #11

Any way to block it from a website? AverageOldGuy Apr 2023 #8

Snip from the OP: Justice matters. Apr 2023 #9

See the end of the OP. The guy who created this scraping tool gives highplainsdem Apr 2023 #10

marble falls

(71,919 posts)

1. Just another pirate sailing the electron sea. Like bitcoin miners.

Reply to highplainsdem (Original post)

Tue Apr 25, 2023, 09:06 PM

Apr 2023

blogslug

(39,167 posts)

2. I wonder if the WC3 has an opinion on this

Reply to highplainsdem (Original post)

Tue Apr 25, 2023, 09:07 PM

Apr 2023

Dude seems to be implementing his own HTML standard. I'm no expert so maybe I don't know what I'm talking about.

progressoid

(53,179 posts)

3. If you think Russian bots were a problem in previous elections, you ain't seen nothin' yet.

Reply to highplainsdem (Original post)

Tue Apr 25, 2023, 09:45 PM

Apr 2023

slightlv

(7,790 posts)

4. With this and the other AI generated

Reply to progressoid (Reply #3)

Tue Apr 25, 2023, 10:01 PM

Apr 2023

disinformation, you might as well hang it up on reading anything and knowing where it's true or not. There'll be just no way. And you know, just like this smug pirate has done, there'll be R programmers of AI who'll swing it to chat out "their truth" with backup, etc., that shows their side only, with pics to "prove it."

Look, I know AI is inevitable. The whole trouble with the World, in general, is IF something can be done, it WILL be done... never mind if it Should be done, and damn the consequences. I honestly don't know how we get out of this. As one who, at 67, made her whole career in computers in all formats - from building to programming to design for business and radio - I'm ready to go back to horse and buggy and kill the devices until the humans evolve a whole lot more. YMMV, obviously.

progressoid

(53,179 posts)

12. Right wing loons are already susceptible to conspiracies.

Reply to slightlv (Reply #4)

Wed Apr 26, 2023, 02:16 AM

Apr 2023

This will just make it magnitudes worse.

slightlv

(7,790 posts)

13. I see they're already at it...

Reply to progressoid (Reply #12)

Wed Apr 26, 2023, 11:58 AM

Apr 2023

From the "front page" of DU this morning, looks like they're already doing their first attempts. They'll only get worse from here...

2naSalit

(102,789 posts)

5. Sounds like AI...

Reply to highplainsdem (Original post)

Tue Apr 25, 2023, 10:13 PM

Apr 2023

Needs to be destroyed already.

LostOne4Ever

(9,752 posts)

6. We need lawsuits and regulations pronto

Reply to highplainsdem (Original post)

Tue Apr 25, 2023, 11:18 PM

Apr 2023

And lots of them.

soldierant

(9,354 posts)

7. I was thinking pretty much the same.

Reply to LostOne4Ever (Reply #6)

Tue Apr 25, 2023, 11:27 PM

Apr 2023

Images are owned, and owners have copyright in their work, whether or not they have registered it.

highplainsdem

(62,136 posts)

11. +1,000,000

Reply to LostOne4Ever (Reply #6)

Wed Apr 26, 2023, 01:23 AM

Apr 2023

AverageOldGuy

(3,833 posts)

8. Any way to block it from a website?

Reply to highplainsdem (Original post)

Tue Apr 25, 2023, 11:32 PM

Apr 2023

Justice matters.

(9,786 posts)

9. Snip from the OP:

Reply to AverageOldGuy (Reply #8)

Tue Apr 25, 2023, 11:54 PM

Apr 2023

Img2dataset will attempt to scrape images from any site unless site owners add https headers like “X-Robots-Tag: noai,” and “X-Robots-Tag: noindex.” That means that the onus is on site owners, many of whom probably don’t even know img2dataset exists, to opt out of img2dataset rather than opt in.

No idea of how that electrons stuff works.

highplainsdem

(62,136 posts)

10. See the end of the OP. The guy who created this scraping tool gives

Reply to Justice matters. (Reply #9)

Wed Apr 26, 2023, 12:01 AM

Apr 2023

users of the tool a way to avoid that sort of code so they can scrape the website anyway.

Reply to this discussion