Report: OpenAI trained its AI model using over a million hours of YouTube videos

In a recent report by The New York Times, it has been revealed that OpenAI, led by Sam Altman, transcribed over a million hours of YouTube videos to train its AI model called GPT-4. The company believed this action to be fair use, despite acknowledging its questionable legality. OpenAI president Greg Brockman was personally involved in collecting the videos used for training.

An OpenAI spokesperson informed The Verge that the company utilizes various sources, including publicly available data and partnerships for non-public data, to ensure its global research competitiveness. Google, the owner of YouTube, has stated that it has seen unconfirmed reports regarding OpenAI’s activities. The tech giant reiterated that both its robots.txt files and Terms of Service prohibit unauthorized scraping or downloading of YouTube content.

Previously, The Information had reported that OpenAI, now backed by Microsoft, had trained its AI models using data from YouTube. This action was done discreetly, with OpenAI using the rich source of imagery, audio, and text transcripts available on the platform. YouTube remains the largest and most abundant source of multimedia content on the internet.

OpenAI’s method of using YouTube data for training its AI models has sparked concerns over the legality and ethics of such practices. While the company aims to maintain its competitive edge in global research, questions have been raised about the potential implications of utilizing copyrighted content without proper authorization. The repercussions of such actions in the field of artificial intelligence and machine learning remain to be seen.


