Tired of seeing their hard work pilfered by the tech sectorās artificial intelligence giants, the creative industry is starting to fight back. While on theĀ surface itsĀ argument is about the principle of copyright, what the clashĀ reveals is just how little we know about the data behindĀ breakthrough tech like ChatGPT. TheĀ lack of transparency is getting worse, and it stands in the way of creatives being fairlyĀ paid and, ultimately, of AI being safe.
AĀ trickleĀ of legalĀ challenges against AI companiesĀ could soonĀ become a downpour. Media conglomerateĀ IAC isĀ reportedĀ to beĀ teaming up with large publishers includingĀ the New York Times in aĀ lawsuit alleging the improperĀ use of their content to build AI-powered chatbots.
One reading of this is that publishers are running scared. The threat AI poses to their businesses is obvious: People who might have once readĀ a newspaperās restaurant reviews may now choose to ask an AI chatbotĀ where to go to dinner, and so on.
But the bigger factorĀ is that publishers are beginning to understand their value in the age of AI,Ā albeit somewhat after the horse has bolted. AI models are only as good as the dataĀ put in them. Text and images produced by leading media organisationsĀ in theory should be of highĀ quality and help AI tools like ChatGPT generate better results.Ā If AI companies want to use great articles and photography, created by real people, they should be paying for the privilege. So far, for the most part, they havenāt been.
Forcing them to change is going to prove difficult, thanks to some willful acts of obfuscation. As AI has grown more sophisticated, transparency hasĀ taken a back seat. In a distinct departure from the early days of machine-learning research, when teams of computer scientists,Ā such as the Transformer 8, went into intricate detail over the training data, leading AI developersĀ are nowĀ using vague language about their sources.
OpenAIās GPT-4Ā is trainedĀ āusing publicly available data [such as internet data]Ā as well as data weāve licensed,ā the company explained in its release notes for the model, revealingĀ little else. Metaās equivalent,Ā theĀ newly releasedĀ LlamaĀ 2,Ā was similarly vague. The companyĀ said itĀ hadĀ been trained on a ānew mix of data from publicly available sources.ā
Contrast that withĀ what Meta saidĀ in February when it unveiledĀ the first version of Llama. Then, it broke downĀ in a spreadsheetĀ the various sources that had been used: 4.5 percent of the dataset, for example, consisted of 83 gigabytes-worth of Wikipedia articles, in 20 languages, scraped between June and August 2022.
Those old disclosures were enough to provoke twoĀ recent class action lawsuitsĀ fronted by comedian Sarah Silverman and two other authors. They argue that even those vague early descriptionsĀ from OpenAI and Meta about sources raised the likelihood the companies usedĀ the writersāĀ books without permission.
But it isnātĀ an exact science: Getting to the bottom of where training data for AI comes from is like unstacking a Russian nesting doll. By the time data is picked up by a company like OpenAI, it may have been gathered and processed by any number of smaller groups. Accountability becomes a lot more difficult.
In the search for common sense regulation on AI, insisting on transparency seems like a straightforward place to start. Only by understanding what is in datasets can we begin to tackle the next step of limiting the potential harm of the technology. Knowing more about the data reveals not only the owners of that content but also any inherent flaws within, allowing outsiders to examine for bias or blind spots.
Plus, only by supporting the economy that creates content can more of it be sustainably made. The risk of āinbreedingā āĀ where AI-generated text ends up training future modelsĀ āĀ could exacerbate quality control issues withinĀ large language models. āIf they bankrupt the creative industry, theyāll end up bankrupting themselves,ā said Matthew Butterick, one of the attorneys behind the Silverman effort.
At a White House meeting last week,Ā seven of the largest AI companies
agreed to voluntary measures around safety, security and trust. Included were smart suggestions on pre-release testing, cybersecurityĀ and disclosures to the end userĀ on when something has been made by AI.
All good ideas. But whatās urgently neededĀ areĀ laws requiring standardised disclosures on what data sources have been used to train large language models.Ā Otherwise, the pledges to avoid the same mistakes made with social media, when āblack boxā algorithms caused great societal damage, ring hollow. Senate Majority Leader Chuck SchumerĀ is preparingĀ sweeping regulations with a promise to take into consideration how to protect copyright and intellectual property. The European Unionās proposedĀ AI Act couldĀ set a standardĀ by forcing disclosure when copyrighted material is used. The US Federal Trade Commission, inĀ a letterĀ to OpenAI this month, demanded more information on āall sources of dataā for GPT. Weāll see what that turns up.
In the meantime, content licensing agreements,Ā such as the oneĀ recently enteredĀ into by the Associated Press and OpenAI,Ā seem like a step in the right direction, though with the terms undisclosed itās hard to know who benefits theĀ most.
Unlike the all-smiles agreement by the AI companies onĀ the White House voluntary measures ā which should be reason enough to be suspicious of themĀ āĀ tougher data disclosure requirements wonātĀ come without heavy resistance from Silicon Valley. Content creators and the tech titans are headed for a cultural collision. OpenAIĀ Chief Executive Officer Sam Altman gave a recent taste, writing on Twitter: āEverything ācreativeāĀ is a remix of things that happened in the past.ā
Expect this to become both the moral justification for scraping content at will, but also the legal foundation. Tech companies argue that such use of data can be covered under āfair use,ā the legalĀ doctrine that has long allowed for building on copyrighted works as an inspiration, subject to some stipulations over its intended use.
Itās becoming clear thatĀ protections designed to helpĀ creatives are at risk of beingĀ weaponised as aĀ justificationĀ for not paying them; for not evenĀ telling them their work has been taken at all. Weāre just starting to see this defense tested in court. It can only be a fair trial if AI companies are forced to be honest about how their technology really works.
Dave Lee is Bloomberg Opinion's US technology columnist. Views are personal and do not represent the stand of this publication.
Credit: BloombergĀ
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!