YouTuber files class action suit over OpenAI’s scrape of creators’ transcripts

August 6, 2024
Harsh Gautam

A YouTube creator is attempting to file a class action lawsuit against OpenAI, claiming that the business trained its generative AI models using millions of transcripts from YouTube videos without alerting or compensating the content owners.

In a complaint filed Friday in the United States District Court for the Northern District of California, attorneys for David Millette, a YouTube user based in Massachusetts, claim that OpenAI secretly transcribed Millette's and other creators' videos to train the models that power the company's AI-powered chatbot platform, ChatGPT, as well as other generative AI tools and products. The complaint argues that by collecting this data, OpenAI "profited significantly" from the creators' work, while infringing copyright law and YouTube's terms of service, which ban the usage of videos for apps independent of its service. 

“As [OpenAI’s] AI products become more sophisticated through the use of training data sets, they become more valuable to prospective and current users, who purchase subscriptions to access [OpenAI’s] AI products,” the complaint reads. “Much of the material in OpenAI’s training data sets, however, comes from works that were copied by OpenAI without consent, without credit, and without compensation.”

Millette, represented by the law firm Bursor & Fisher, is seeking a jury trial and over $5 million in damages for all YouTube users and creators whose data might’ve been swept up in OpenAI’s training. Generative AI models like OpenAI’s have no real intelligence. Fed an enormous number of examples (e.g., movies, voice recordings, essays), models “learn” how likely data is to occur based on patterns, including the context of any surrounding data.

The majority of models are trained using data from public websites and datasets available on the web. Companies claim that fair use protects their efforts to indiscriminately grab data and utilize it to build commercial models. Many copyright holders, however, disagree, and are filing lawsuits to stop the practice.

As other data sources run dry, video transcriptions have emerged as a critical training data component.

Originality.AI research shows that more than 35% of the world's top 1,000 websites now restrict OpenAI's web crawler. According to a research conducted by MIT's Data Provenance Initiative, approximately 25% of data from "high-quality" sources is prohibited from the primary datasets used to train AI models. If the current access-blocking trend continues, the research firm Epoch AI projects that developers will run out of data to train generative AI models between 2026 and 2032.

The New York Times reported in April that OpenAI developed its first speech recognition model, Whisper, to obtain more training data by transcribing audio from videos. According to The Times, an OpenAI team led by the company's president, Greg Brockman, utilized Whisper to transcribe over a million hours of video from YouTube and train OpenAI's text-generating and text-analyzing model, GPT-4.

According to the Times, some OpenAI employees debated whether such a move would violate YouTube's policies.

Proof News revealed in July that firms such as Anthropic, Apple, Salesforce, and Nvidia used a dataset dubbed The Pile to train generative AI models. The dataset contains subtitles from hundreds of thousands of YouTube movies. Many YouTube producers whose subtitles were swept up in The Pile were unaware of and refused to consent to this; Apple later issued a statement stating that it had no plans to use those models to power any AI features in its products.

Google, YouTube's parent firm, has also considered using transcripts to train its models.

Last year, Google broadened its terms of service (ToS) partly to allow the company to tap more user data for generative AI model training. Under the old ToS, it wasn’t clear whether Google could use YouTube data to build products beyond the video platform. Not so under the new terms, which loosen the reins considerably. We’ve reached out to OpenAI and Google for comment on the class action suit and will update this piece if they respond. It’s been a rough start to the month for OpenAI.

On Monday, Tesla and X CEO Elon Musk filed a new lawsuit against OpenAI and CEO Sam Altman, accusing the business of forsaking its initial nonprofit objective by reserving some of its most powerful technology for commercial users. Musk made the identical charges in a February lawsuit against OpenAI, but the current action argues that OpenAI is also engaging in racketeering conduct.