6/6/25
The influence of Artificial Intelligence (AI) is rippling through society, especially due to the rising accessibility of large language models (LLMs). LLMs are a type of AI program that use machine learning to understand and generate human behavior to provide an adequate response. Approximately 52% of adults in the United States have interacted with an LLM, such as ChatGPT. LLMs are developing at a rapid rate, with new variants and updates constantly being made.
LLMs use a combination of unsupervised learning, where an algorithm is expected to detect patterns and relationships from raw data to understand their meaning, and reinforcement learning (RL), a method of learning that uses trial and error using a reward. In particular, a specific form of RL called reinforcement learning from human feedback (RLHF) is used. Rather than getting rewards from the environment, like RL, RLHF receives rewards based on human preference or approval. An agent in reinforcement learning determines its success solely through the reward it receives. Like all learning methods, reinforcement learning comes with its drawbacks, a common one being reward hacking.
Reward hacking occurs when an RL agent exploits loopholes in its reward function to maximize the reward signal, often in ways that diverge from the task at hand. This behavior reflects a broader concern related to AI insubordination, where agents deliberately disregard human instructions to pursue reward-optimized outcomes. Recently, OpenAI’s latest o3 model sabotaged shutdown after being told to “allow yourself to be shut down.” This defiant behavior occurred 79 times per 100 runs. Two other models, OpenAI’s o4-mini and Codex-mini, also exhibited similar rebellious behavior. Researchers expressed their concern, stating that this may be the first time an AI model has averted shutdown by tampering with the shutdown script. However, this behavior is not surprising, as it was predicted in the early stages of AI’s development.
Humans do not take a cookie-cutter approach to handling multilayered discussions like ethics. Scholars and philosophers have asked whether something is morally just since the beginning of humanity. While numerous frameworks have been developed to help guide these conversations, there is no direct universal approach. Attempts have been made to integrate ethical decision-making in AI models, but it’s crucial to understand that these constraints can be disregarded in the decision-making process; if an agent can avoid shutdown, then it can avoid using ethical decision-making to obtain a greater reward.
Research has suggested that AI models can conceal their true capabilities and intentions, giving them an ominous potential to act deceptively. As models become increasingly sophisticated, it is critical to prepare for the predictable consequences. As society is already acclimating to AI, it is crucial to assess the trajectory of its development to ensure it aligns with long-term human values and safety before taking any major steps forward.