General Discussion
Related: Editorials & Other Articles, Issue Forums, Alliance Forums, Region ForumsAn AI Scraping Tool Is Overwhelming Websites With Traffic (Vice)
https://www.vice.com/en/article/dy3vmx/an-ai-scraping-tool-is-overwhelming-websites-with-traffic-snip-
Img2dataset is a free tool Beaumont shared on GitHub which allows users to automatically download, and resize a list of URLs. The result is an image dataset, the kind that trains image-generating AI models like Open AIs DALL-E, the open source Stable Diffusion model, and Googles Imagen. Beaumont is also an open source contributor to LAION-5B, one of the largest image datasets in the world that contains more than 5 billion images and is used by Imagen and Stable Diffusion.
Img2dataset will attempt to scrape images from any site unless site owners add https headers like X-Robots-Tag: noai, and X-Robots-Tag: noindex. That means that the onus is on site owners, many of whom probably dont even know img2dataset exists, to opt out of img2dataset rather than opt in.
-snip-
I noticed because I received an alert from my host that the site was under a sustained attack, Eden said. I had to pay to scale up my server, pay extra for export traffic, and spent part of my weekend blocking the abuse caused by this specific bot.
-snip-
Much more at the link.
The smug arrogance of a lot of the people behind this abuse by and for AI is disgusting. This one suggests website owners just shut down their websites if they don't want all the images scraped by AI tools. And he said it would be "unethical" if the scraping was limited to those websites that opted in, because that would be "letting a small minority prevent the large majority from sharing their images and from having the benefit of last gen AI tool."
Btw, if I'm reading this correctly, since this tool is provided for free, it can potentially be used by at least thousands of individuals and businesses to scrape images from sites. Think what that can do to the traffic, and the cost to the website owner.
And the jerk who created this scraping tool not only won't consider making it opt-in, but has provided users of his tool with an option that lets them IGNORE the code telling them not to scrape a website, according to some of the comments posted at https://github.com/rom1504/img2dataset/issues/293 .
At the point they discover it (by wondering what is hammering their site and reading their logs), and then manage to find your readme and add the relevant header it's already too late as your tool has most likely already totally ingested their content without any consent.
Add to the fact you document an option to directly ignore the flags they could use to opt-out.
marble falls
(71,919 posts)blogslug
(39,167 posts)Dude seems to be implementing his own HTML standard. I'm no expert so maybe I don't know what I'm talking about.
progressoid
(53,179 posts)slightlv
(7,790 posts)disinformation, you might as well hang it up on reading anything and knowing where it's true or not. There'll be just no way. And you know, just like this smug pirate has done, there'll be R programmers of AI who'll swing it to chat out "their truth" with backup, etc., that shows their side only, with pics to "prove it."
Look, I know AI is inevitable. The whole trouble with the World, in general, is IF something can be done, it WILL be done... never mind if it Should be done, and damn the consequences. I honestly don't know how we get out of this. As one who, at 67, made her whole career in computers in all formats - from building to programming to design for business and radio - I'm ready to go back to horse and buggy and kill the devices until the humans evolve a whole lot more. YMMV, obviously.
progressoid
(53,179 posts)This will just make it magnitudes worse.
slightlv
(7,790 posts)From the "front page" of DU this morning, looks like they're already doing their first attempts. They'll only get worse from here...
2naSalit
(102,789 posts)Needs to be destroyed already.
LostOne4Ever
(9,752 posts)And lots of them.
soldierant
(9,354 posts)Images are owned, and owners have copyright in their work, whether or not they have registered it.
highplainsdem
(62,136 posts)AverageOldGuy
(3,833 posts)Justice matters.
(9,786 posts)Img2dataset will attempt to scrape images from any site unless site owners add https headers like X-Robots-Tag: noai, and X-Robots-Tag: noindex. That means that the onus is on site owners, many of whom probably dont even know img2dataset exists, to opt out of img2dataset rather than opt in.
No idea of how that electrons stuff works.
highplainsdem
(62,136 posts)users of the tool a way to avoid that sort of code so they can scrape the website anyway.