GeForce GPU giant has been data scraping 80 years' worth of videos every day for AI training to 'unlock various downstream applications critical to Nvidia'

TAIPEI, TAIWAN - 2023/06/01: Jensen Huang, President of NVIDIA holding the Grace hopper superchip CPU used for generative AI at supermicro keynote presentation during the COMPUTEX 2023. The COMPUTEX 2023 runs from 30 May to 02 June 2023 and gathers over 1,000 exhibitors from 26 different countries with 3000 booths to display their latest products and to sign orders with foreign buyers.
(Image credit: Walid Berrazeg/SOPA Images/LightRocket via Getty Images)

Leaked documents, including spreadsheets, emails, and chat messages, show that Nvidia has been using millions of YouTube videos, Netflix, and other sources to train an AI model to be used in its Omniverse, autonomous vehicles, and digital avatar platforms.

The astonishing, but perhaps not surprising, scope of the data scraping was reported by 404 Media, who investigated the documents. It discovered that an internal project codenamed Cosmos (the same name but different to Nvidia's Cosmos Deep Learning service) had staff use dozens of virtual PCs on Amazon Web Service (AWS) to download so many videos per day that Nvidia accumulated over 30 million URLs in the space of one month.

Copyright laws and usage rights were repeatedly discussed by the employees, who found some creative ways to prevent any direct violation of them. For example, Nvidia employed the use of Google's cloud service to download the YouTube-8M dataset, as directly downloading the videos isn't permitted by the terms of service. 

In a leaked Slack channel discussion, one person remarked that "we cleared the download with Google/YouTube ahead of time and dangled as a carrot that we were going to do so using Google Cloud. After all, usually, for 8 million videos, they would get lots of ad impressions, revenue they lose out on when downloading for training, so they should get some money out of it."

404 Media asked Nvidia to comment on the legal and ethical aspects of using copyrighted material for AI training and the company replied that it was in "in full compliance with the letter and the spirit of copyright law."

With some datasets, their use is only permitted for academic purposes and although Nvidia does conduct a considerable amount of research (internally and with other institutions), the leaked materials clearly show that this data scraping was intended for commercial purposes.

Nvidia isn't the only firm to be doing this, of course—OpenAI and Runway have both been accused of knowingly using copyrighted and protected material to train their AI models. Interestingly, one source of video content that you'd think Nvidia would have no problem using is gameplay footage from its GeForce Now service—but the leaked documents show that's not the case.

A senior research scientist at Nvidia explained why to other employees: "We don't yet have statistics or video files yet, because the infras is not yet set up to capture lots of live game videos & actions. There're both engineering & regulatory hurdles to hop through."

AI models have to be trained on billions of data points and there's no way around this. Some datasets have very clear rules for their use, whereas others have fairly loose restrictions, but when it comes to laws on the use of copyrighted materials, it's very clear what can and can't be done, even if the application of it to AI training isn't 100% transparent.

AI, explained

OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.

(Image credit: Jakub Porzycki/NurPhoto via Getty Images)

What is artificial general intelligence?: We dive into the lingo of AI and what the terms actually mean.

It's not just about copyright, either, as video content often contains personal data. While there isn't a single, overriding federal law in the US that is directly applicable here, there are plenty of regulations concerning collecting and using personal data. In the EU, the General Data Protection Regulation (GPDR) is a law that is expressly clear on how such data can be used, even outside of the EU.

One might also wonder what would happen if a company such as Nvidia is found to have breached various regulations whilst training its AI models—if that system is being used across the globe, would it then be blocked in specific countries? Would the likes of Nvidia be willing to make a new model, trained with all permissions granted, just for those locations? Is it even possible to 'untrain' a system and start afresh with legally compliant data?

Whatever one feels about AI, it's clear that there needs to be a more urgent push for transparency, especially when it concerns the use of copyrighted and personal data for commercial purposes. Because if tech companies aren't held accountable, then data scraping will continue ad hoc.

TOPICS
Nick Evanson
Hardware Writer

Nick, gaming, and computers all first met in 1981, with the love affair starting on a Sinclair ZX81 in kit form and a book on ZX Basic. He ended up becoming a physics and IT teacher, but by the late 1990s decided it was time to cut his teeth writing for a long defunct UK tech site. He went on to do the same at Madonion, helping to write the help files for 3DMark and PCMark. After a short stint working at Beyond3D.com, Nick joined Futuremark (MadOnion rebranded) full-time, as editor-in-chief for its gaming and hardware section, YouGamers. After the site shutdown, he became an engineering and computing lecturer for many years, but missed the writing bug. Cue four years at TechSpot.com and over 100 long articles on anything and everything. He freely admits to being far too obsessed with GPUs and open world grindy RPGs, but who isn't these days? 

Read more
OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.
If you don't let us scrape copyrighted content, we will lose out to China says OpenAI as it tries to influence US government
Redhead woman using computer laptop at home stressed with hand on head, shocked with shame and surprise face, angry and frustrated. Fear and upset for mistake.
Court documents show not only did Meta torrent terabytes of pirated books to train AI models, employees wouldn't stop emailing each other about it: 'Torrenting from a corporate laptop doesn't feel right'
SUQIAN, CHINA - JANUARY 27, 2025 - An illustration photo shows the logo of DeepSeek and ChatGPT in Suqian, Jiangsu province, China, January 27, 2025. (Photo credit should read CFOTO/Future Publishing via Getty Images)
The brass balls on these guys: OpenAI complains that DeepSeek has been using its data, you know, the copyrighted data it's been scraping from everywhere
SUQIAN, CHINA - JANUARY 27, 2025 - An illustration photo shows the logo of DeepSeek and ChatGPT in Suqian, Jiangsu province, China, January 27, 2025. (Photo credit should read CFOTO/Future Publishing via Getty Images)
China's DeepSeek chatbot reportedly gets much more done with fewer GPUs but Nvidia still thinks it's 'excellent' news
One YouTuber has been poisoning AI tools that access her videos with .ass subtitle files and you can too
Nvidia H100 chips inside a server room at the Yotta Data Services Pvt. data center, in Navi Mumbai, India
Turns out there's 'a big supercomputer at Nvidia… running 24/7, 365 days a year improving DLSS. And it's been doing that for six years'
Latest in AI
Image for
'No real human would go four links deep into a maze of AI-generated nonsense': Cloudflare's AI Labyrinth uses decoy pages to trap web-crawling bots and feed them slop 'as a defensive weapon'
CHINA - 2025/02/11: In this photo illustration, a Roblox logo is seen displayed on the screen of a smartphone. (Photo Illustration by Sheldon Cooper/SOPA Images/LightRocket via Getty Images)
'Humans still surpass machines': Roblox has been using a machine learning voice chat moderation system for a year, but in some cases you just can't beat real people
OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.
ChatGPT faces legal complaint after a user inputted their own name and found it accused them of made-up crimes
Public Eye trailer still - dead-eyed police officer sitting for an interview
I'm creeped out by this trailer for a generative AI game about people using an AI-powered app to solve violent crimes in the year 2028 that somehow isn't a cautionary tale
Closeup of the new Copilot key coming to Windows 11 PC keyboards
Microsoft co-authored paper suggests the regular use of gen-AI can leave users with a 'diminished skill for independent problem-solving' and at least one AI model seems to agree
Still image of Bastion holding a bird, taken from Microsoft's Copilot for Gaming reveal trailer
Microsoft unveils Copilot for Gaming, an AI-powered 'ultimate gaming sidekick' that will let you talk to your console so you don't have to talk to your friends
Latest in News
A True Kin knight stands in a ruin in Caves of Qud, flanked by bloodstained furniture and a freshly mortalized corpse.
Despite making a roguelike where you can have countless arms and legs, Caves of Qud's creators say the ideal form is a limbless sphere: 'We started in perfection and only moved farther from God'
Civilization 7 Great Britain - Modern Civ art (via YouTube)
As Civilization 7 struggles to keep up with Civ 5 player counts, a new patch is coming tomorrow with still more UI changes and gameplay tweaks
Metaphor: ReFantazio character art
Metaphor: ReFantazio battle director says turn-based RPGs can still be just as popular as action RPGs: 'I personally believe turn-based games have a long future ahead of them'
assassin's creed shadows review
Assassin's Creed Shadows streamer goes viral after confronting whining commenters: 'Normal people don't get upset about this sh***'
Assassin's Creed Shadows change seasons - An upper-body shot of Yasuke looking cheerfully up into the distance.
'This is just the beginning': Assassin's Creed Shadows dev team thanks fans for their support and promises more to come in the future
Geralt sitting on a wall wearing a Cyberpunk jacket modded by TheRealArdCarraigh
The Witcher 3 devs had to practically remake the game engine to make official modding possible