Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

highplainsdem

(62,136 posts)
Tue Apr 25, 2023, 09:01 PM Apr 2023

An AI Scraping Tool Is Overwhelming Websites With Traffic (Vice)

https://www.vice.com/en/article/dy3vmx/an-ai-scraping-tool-is-overwhelming-websites-with-traffic

The creator of a tool that scrapes the internet for images in order to power artificial intelligence image generators like Stable Diffusion is telling website owners who want him to stop that they have to actively opt out, and that it’s “sad” that they are fighting the inevitable rise of AI.

-snip-

Img2dataset is a free tool Beaumont shared on GitHub which allows users to automatically download, and resize a list of URLs. The result is an image dataset, the kind that trains image-generating AI models like Open AI’s DALL-E, the open source Stable Diffusion model, and Google’s Imagen. Beaumont is also an open source contributor to LAION-5B, one of the largest image datasets in the world that contains more than 5 billion images and is used by Imagen and Stable Diffusion.

Img2dataset will attempt to scrape images from any site unless site owners add https headers like “X-Robots-Tag: noai,” and “X-Robots-Tag: noindex.” That means that the onus is on site owners, many of whom probably don’t even know img2dataset exists, to opt out of img2dataset rather than opt in.

-snip-

“I noticed because I received an alert from my host that the site was under a sustained attack,” Eden said. “I had to pay to scale up my server, pay extra for export traffic, and spent part of my weekend blocking the abuse caused by this specific bot.”

-snip-


Much more at the link.

The smug arrogance of a lot of the people behind this abuse by and for AI is disgusting. This one suggests website owners just shut down their websites if they don't want all the images scraped by AI tools. And he said it would be "unethical" if the scraping was limited to those websites that opted in, because that would be "letting a small minority prevent the large majority from sharing their images and from having the benefit of last gen AI tool."

Btw, if I'm reading this correctly, since this tool is provided for free, it can potentially be used by at least thousands of individuals and businesses to scrape images from sites. Think what that can do to the traffic, and the cost to the website owner.

And the jerk who created this scraping tool not only won't consider making it opt-in, but has provided users of his tool with an option that lets them IGNORE the code telling them not to scrape a website, according to some of the comments posted at https://github.com/rom1504/img2dataset/issues/293 .

The fact that you then provide an option for users of your tool to then even disregard the choice of people who do explicitly remove consent is an alarming red flag.


I'm sorry, but your logic is totally flawed, it depends on image owners knowing your tool exists before it indexes their site.

At the point they discover it (by wondering what is hammering their site and reading their logs), and then manage to find your readme and add the relevant header it's already too late as your tool has most likely already totally ingested their content without any consent.

Add to the fact you document an option to directly ignore the flags they could use to opt-out.

13 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies
An AI Scraping Tool Is Overwhelming Websites With Traffic (Vice) (Original Post) highplainsdem Apr 2023 OP
Just another pirate sailing the electron sea. Like bitcoin miners. marble falls Apr 2023 #1
I wonder if the WC3 has an opinion on this blogslug Apr 2023 #2
If you think Russian bots were a problem in previous elections, you ain't seen nothin' yet. progressoid Apr 2023 #3
With this and the other AI generated slightlv Apr 2023 #4
Right wing loons are already susceptible to conspiracies. progressoid Apr 2023 #12
I see they're already at it... slightlv Apr 2023 #13
Sounds like AI... 2naSalit Apr 2023 #5
We need lawsuits and regulations pronto LostOne4Ever Apr 2023 #6
I was thinking pretty much the same. soldierant Apr 2023 #7
+1,000,000 highplainsdem Apr 2023 #11
Any way to block it from a website? AverageOldGuy Apr 2023 #8
Snip from the OP: Justice matters. Apr 2023 #9
See the end of the OP. The guy who created this scraping tool gives highplainsdem Apr 2023 #10

blogslug

(39,167 posts)
2. I wonder if the WC3 has an opinion on this
Tue Apr 25, 2023, 09:07 PM
Apr 2023

Dude seems to be implementing his own HTML standard. I'm no expert so maybe I don't know what I'm talking about.

slightlv

(7,790 posts)
4. With this and the other AI generated
Tue Apr 25, 2023, 10:01 PM
Apr 2023

disinformation, you might as well hang it up on reading anything and knowing where it's true or not. There'll be just no way. And you know, just like this smug pirate has done, there'll be R programmers of AI who'll swing it to chat out "their truth" with backup, etc., that shows their side only, with pics to "prove it."

Look, I know AI is inevitable. The whole trouble with the World, in general, is IF something can be done, it WILL be done... never mind if it Should be done, and damn the consequences. I honestly don't know how we get out of this. As one who, at 67, made her whole career in computers in all formats - from building to programming to design for business and radio - I'm ready to go back to horse and buggy and kill the devices until the humans evolve a whole lot more. YMMV, obviously.

progressoid

(53,179 posts)
12. Right wing loons are already susceptible to conspiracies.
Wed Apr 26, 2023, 02:16 AM
Apr 2023

This will just make it magnitudes worse.

slightlv

(7,790 posts)
13. I see they're already at it...
Wed Apr 26, 2023, 11:58 AM
Apr 2023

From the "front page" of DU this morning, looks like they're already doing their first attempts. They'll only get worse from here...

soldierant

(9,354 posts)
7. I was thinking pretty much the same.
Tue Apr 25, 2023, 11:27 PM
Apr 2023

Images are owned, and owners have copyright in their work, whether or not they have registered it.

Justice matters.

(9,786 posts)
9. Snip from the OP:
Tue Apr 25, 2023, 11:54 PM
Apr 2023
Img2dataset will attempt to scrape images from any site unless site owners add https headers like “X-Robots-Tag: noai,” and “X-Robots-Tag: noindex.” That means that the onus is on site owners, many of whom probably don’t even know img2dataset exists, to opt out of img2dataset rather than opt in.


No idea of how that electrons stuff works.

highplainsdem

(62,136 posts)
10. See the end of the OP. The guy who created this scraping tool gives
Wed Apr 26, 2023, 12:01 AM
Apr 2023

users of the tool a way to avoid that sort of code so they can scrape the website anyway.

Latest Discussions»General Discussion»An AI Scraping Tool Is Ov...