A new set of laws governing the use of artificial intelligence (AI) in the European Union will force companies to be more transparent about the data used to train their systems, prying open one of the industry’s most closely guarded secrets.
In the 18 months since Microsoft-backed OpenAI unveiled ChatGPT to the public, there has been a surge of public engagement and investment in generative AI, a set of applications that can be used to rapidly produce text, images, and audio content.
But as the industry booms, questions have been raised over how AI companies obtain the data used to train their models, and whether feeding them bestselling books and Hollywood movies without their creators’ permission amounts to a breach of copyright.
The EU’s recently-passed AI Act is being rolled out in phases over the next two years, giving regulators time to implement the new laws while businesses grapple with a new set of obligations. But how exactly some of these rules will work in practice is still unknown.
(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)
One of the more contentious sections of the Act states that organisations deploying general-purpose AI models, such as ChatGPT, will have to provide “detailed summaries” of the content used to train them. The newly established AI Office said it plans to release a template for organizations to follow in early 2025, following a consultation with stakeholders.
While the details have yet to be hammered out, AI companies are highly resistant to revealing what their models have been trained on, describing the information as a trade secret that would give competitors an unfair advantage were it made public.
“It would be a dream come true to see my competitors’ datasets, and likewise for them to see ours,” said Matthieu Riouf, CEO of AI-powered image-editing firm Photoroom.
“It’s like cooking,” he added. “There’s a secret part of the recipe that the best chefs wouldn’t share, the ‘je ne sais quoi’ that makes it different.”
How granular these transparency reports end up being will have big implications for smaller AI startups and big tech companies like Google and Meta, which have put the technology at the centre of their future operations.
Sharing trade secrets
Over the past year, a number of prominent tech companies, including Google, OpenAI, and Stability AI have faced lawsuits from creators claiming their content was improperly used to train their models.
While U.S. President Joe Biden has passed a number of executive orders focused on the security risks of AI, questions over copyright have not been fully tested. Calls for tech companies to pay rights holders for data have received bipartisan support in Congress.
Amid growing scrutiny, tech companies have signed a flurry of content-licensing deals with media outlets and websites. Among others, OpenAI signed deals with the Financial Times and The Atlantic, while Google struck deals with NewsCorp social media site Reddit.
Despite such moves, OpenAI drew criticism in March when CTO Mira Murati declined to answer a question from the Wall Street Journal on whether YouTube videos had been used to train its video-generating tool Sora, which the company said would breach its terms and conditions.
Last month, OpenAI faced further backlash for featuring an AI-generated voice described as “eerily similar” to her own by actress Scarlett Johansson in a public demonstration of the newest version of ChatGPT.
Thomas Wolf, co-founder of leading AI startup Hugging Face, said he supported greater transparency, but that sentiment was not shared across the industry. “It’s hard to know how it will work out. There is still a lot to be decided,” he said.
Senior lawmakers across the continent remain divided.
Dragos Tudorache, one of the lawmakers who oversaw the drafting of the AI Act in the European parliament, said that AI companies should be compelled to make their datasets public.
“They have to be detailed enough for Scarlett Johansson, Beyonce, or for whoever to know if their work, their songs, their voice, their art, or their science were used in training the algorithm,” he said.
A Commission official said: “The AI Act acknowledges the need to ensure an appropriate balance between the legitimate need to protect trade secrets and, on the other hand, the need to facilitate the ability of parties with legitimate interests, including copyright holders, to exercise their rights under Union law.”
Under President Emmanuel Macron, the French government has privately opposed introducing rules that could hinder European AI startups’ competitiveness.
Speaking at the Viva Technology conference in Paris in May, French finance minister Bruno Le Maire said he wanted Europe to be a world leader in AI, and not only a consumer of American and Chinese products.
“For once, Europe, which has created controls and standards, needs to understand that you have to innovate before regulating,” he said. “Otherwise, you run the risk of regulating technologies that you haven’t mastered, or regulating them badly because you haven’t mastered them.”
month
Please support quality journalism.
Please support quality journalism.