Saturday, April 6, 2024

Data hunger is leading AI into crazyland [Tulip mania redux]

Oh, I understand why they think they need more more and more data data data data. But I don’t think scaling is the answer and the industry is now headed into crazyland. How can so many smart people be so stupid? Intellectually lazy? Lack of imagination? Above all, though, is simple greed. The prospect of untold riches fame and glory is swamping everything else. When will this story surpass Tulip mania in the annals of greed-driven crazy, or has it already done so?

Hey! We’re rich, we’re powerful. Why not bend/break the law?

Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson and Nico Grant, How Tech Giants Cut Corners to Harvest Data for A.I., NYTimes, April 6, 2024.

The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times.

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by The Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

Like OpenAI, Google transcribed YouTube videos to harvest text for its A.I. models, five people with knowledge of the company’s practices said. That potentially violated the copyrights to the videos, which belong to their creators.

Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its A.I. products.

The companies’ actions illustrate how online information — news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts and movie clips — has increasingly become the lifeblood of the booming A.I. industry. Creating innovative systems depends on having enough data to teach the technologies to instantly produce text, images, sounds and videos that resemble what a human creates.

Copyright suits:

For creators, the growing use of their works by A.I. companies has prompted lawsuits over copyright and licensing. The Times sued OpenAI and Microsoft last year for using copyrighted news articles without permission to train A.I. chatbots. OpenAI and Microsoft have said using the articles was “fair use,” or allowed under copyright law, because they transformed the works for a different purpose.

More than 10,000 trade groups, authors, companies and others submitted comments last year about the use of creative works by A.I. models to the Copyright Office, a federal agency that is preparing guidance on how copyright law applies in the A.I. era.

Theft?

OpenAI was desperate for more data to develop its next-generation A.I. model, GPT-4. So employees discussed transcribing podcasts, audiobooks and YouTube videos, the people said. They talked about creating data from scratch with A.I. systems. They also considered buying start-ups that had collected large amounts of digital data.

OpenAI eventually made Whisper, the speech recognition tool, to transcribe YouTube videos and podcasts, six people said. But YouTube prohibits people from not only using its videos for “independent” applications, but also accessing its videos by “any automated means (such as robots, botnets or scrapers).”

OpenAI employees knew they were wading into a legal gray area, the people said, but believed that training A.I. with the videos was fair use. Mr. Brockman, OpenAI’s president, was listed in a research paper as a creator of Whisper. He personally helped gather YouTube videos and fed them into the technology, two people said.

Mr. Brockman referred requests for comment to OpenAI, which said it uses “numerous sources” of data.

Last year, OpenAI released GPT-4, which drew on the more than one million hours of YouTube videos that Whisper had transcribed. Mr. Brockman led the team that developed GPT-4.

Some Google employees were aware that OpenAI had harvested YouTube videos for data, two people with knowledge of the companies said. But they didn’t stop OpenAI because Google had also used transcripts of YouTube videos to train its A.I. models, the people said. That practice may have violated the copyrights of YouTube creators. So if Google made a fuss about OpenAI, there might be a public outcry against its own methods, the people said.

And Meta too:

Meta’s executives said OpenAI seemed to have used copyrighted material without permission. It would take Meta too long to negotiate licenses with publishers, artists, musicians and the news industry, they said, according to the recordings.

“The only thing that’s holding us back from being as good as ChatGPT is literally just data volume,” Nick Grudin, a vice president of global partnership and content, said in one meeting.

OpenAI appeared to be taking copyrighted material and Meta could follow this “market precedent,” he added.

Meta’s executives agreed to lean on a 2015 court decision involving the Authors Guild versus Google, according to the recordings. In that case, Google was permitted to scan, digitize and catalog books in an online database after arguing that it had reproduced only snippets of the works online and had transformed the originals, which made it fair use.

Using data to train A.I. systems, Meta’s lawyers said in their meetings, should similarly be fair use.

There’s more in the article.

Roll your own

Cade Metz and Stuart A. Thompson, What to Know About Tech Companies Using A.I. to Teach Their Own A.I., NYTimes, April 6, 2024.

Perhaps the way out of this looming impasse is to create synthetic. Use AIs to crank out synthetic data from here to Alpha Centauri and then train future engines on that. Alas, there are problems:

Does synthetic data work?

Not exactly. A.I. models get things wrong and make stuff up. They have also shown that they pick up on the biases that appear in the internet data from which they have been trained. So if companies use A.I. to train A.I., they can end up amplifying their own flaws.

Is synthetic data widely used by tech companies right now?

No. Tech companies are experimenting with it. But because of the potential flaws of synthetic data, it is not a big part of the way A.I. systems are built today.

So why do tech companies say synthetic data is the future?

The companies think they can refine the way synthetic data is created. OpenAI and others have explored a technique where two different A.I. models work together to generate synthetic data that is more useful and reliable.

One A.I. model generates the data. Then a second model judges the data, much like a human would, deciding whether the data is good or bad, accurate or not. A.I. models are actually better at judging text than writing it.

However:

The A.I. models that generate synthetic data were themselves trained on human-created data, much of which was copyrighted. So copyright holders can still argue that companies like OpenAI and Anthropic used copyrighted text, images and video without permission.

There’s more at the link.

No comments:

Post a Comment