Training AI with AI: Danger of Model Collapse on Internet Information Integrity

9/23/2024

The internet is a double-edged sword: while it offers immense opportunities for knowledge, it also breeds misinformation. Recently, this issue has become more concerning as internet users are increasingly using AI to generate online content. Surveys have shown that 55% of Americans claim to use AI regularly and over 13% of online data is AI-generated. As more AI-generated content floods the web, the AIs begin to train on the data they have created. This process, where AI feeds the internet and the internet feeds AI, could make AIs produce misleading content that threatens to turn the internet into an “unusable information junkyard.” 

On July 24, 2024, researchers from world-renowned universities – including Cambridge and Imperial College London – published a study in the journal Nature. The study, led by Dr. Ilia Shumailov and Dr. Zakhar Shumaylov, studied the effects of training language models such as GPT-2, an early version of OpenAI’s language model, with AI-generated data. They discovered that training AI with text datasets that heavily consisted of AI-generated texts  produced noticeably incoherent and misleading outputs. “The original model would never produce some of the data,” researchers wrote in their report, “these are the errors that accumulate because of the learning with generational data.” 

Other studies have also proved the dangers of this phenomenon. Known as “model collapse,” it occurs when AI models accumulate inherent errors and become “narrow-minded”  when trained on AI-synthesized data. The output of this narrow-mindedness is “junk content” – inaccurate and unreadable information. 

The issues presented by model collapse have already become apparent in various social media algorithms. Social media platforms like Twitter use machine learning to curate user experiences. However, a 2024 study has shown that Twitter’s “Who-to-Follow” friend recommendation algorithm creates echo chambers by recommending accounts with similar views. When AI systems train on their own recommendations, the cycle of polarization intensifies, leading to less balanced viewpoints. This opinion differentiation could further hinder communication and encourage political extremism, leading to democratic erosion and social unrest.

Sesh Iyer, North America Regional Chair for BCG X, specializes in Gen AI technologies. He voiced his concerns on the matter in a recent LinkedIn post. “This self-feeding mechanism in AI models can be likened to continuously breathing in one’s exhale; oxygen eventually depletes. The result? AI representations that begin to drift from the ground truth, amplifying errors, biases, and distortions passed down from previous generations.”

So, what can be done to avert this impending crisis? According to researchers from Stanford University, the solution lies in blending AI-generated content with real, human-authored data. “Our results strongly suggest that the ‘curse of recursion’ may not be as dire as had been portrayed,” researchers wrote, “provided we accumulate synthetic data alongside real data, rather than replacing real data with synthetic data only.” 

The convenience of AI is undeniable, but its usage still requires ethical oversight. While the research community can mandate human-generated content in AI training and develop algorithms to detect AI-generated text, we as internet users must encourage human content creation to protect the internet’s authenticity. With the right approach, we can harness AI’s potential while preserving the integrity of our digital landscape.