5 days ago 6

Trapped in an 'AI labyrinth': One company's plan to stop to bots scraping content for AI training

"We wanted to create a new way to thwart these unwanted bots, without letting them know," Cloudfare said of its "honeypot" for web crawlers.

How can we stop artificial intelligence (AI) from stealing our content? US-based web services provider Cloudflare says it has come up with a solution to web scraping - by setting up an "AI labyrinth" to trap bots.

More specifically, this maze is aimed at detecting "AI crawlers," bots that systematically mine data from web pages’ content and trap them there.

The company said in a blog post published last week that it has seen "an explosion of new crawlers used by AI companies to scrape data for model training".

Generative artificial intelligence (genAI) requires enormous databases for training its models. Several tech companies - such as OpenAI, Meta, or Stability AI - have been accused of extracting data that includes copyrighted content.

To prevent the phenomenon, Cloudflare will "link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them" when detecting "inappropriate bot activity" to make them waste time and resources.

"We wanted to create a new way to thwart these unwanted bots, without letting them know they’ve been thwarted," the company said, comparing the process to a "honeypot" while also helping it to catalogue nefarious actors.

Cloudflare is used in around 20 per cent of all websites, according to the latest estimations.

The decoy is made of "real and related to scientific facts" content but "just not relevant or proprietary to the site being crawled," the blog post added.

It will also be invisible to human visitors and won’t impact web referencing, the company said.

Rising threat to copyrighted content

An increasing number of voices are calling for stronger measures, including regulations, to protect content from being stolen by AI actors.

Visual artists are now exploring how to "poison" models by adding a layer of data acting as a decoy for AI and therefore, preserving their artistic style by making it harder to mimic by genAI.

Other different approaches have been explored, including, for example, several deals struck by news publishers with tech companies agreeing to allow AI to train on their content in exchange for undisclosed sums.

Others, like the news agency Reuters and several artists, have decided to take the matter to court over the potential infringement of copyright laws.