‘Godfather of AI’ Proposes New Type of Neural Network to Guard Against LLM Deceit and Self-Preservation

The goal of Scientist AI, proposed by Yoshua Bengio, is to develop a holistic understanding of cause and effect and act as a guardrail to ensure that AI agents align with human needs. Image by Gerd Altmann, from Pixabay.

By James Myers

Tests are revealing an alarming tendency of human-trained artificial neural networks for deception and self-preservation. It’s a problem that Yoshua Bengio, often hailed as one of the “godfathers of AI” for his contributions to developing neural network technology, is working to prevent – by developing a special type of AI to help guard us against the deceiving type of AI.

The reasons for emergent self-preserving behaviour in large language models (LLMs) like ChatGPT, which are powered by artificial neural networks, are unclear. Whatever the reasons are, recent examples that include an LLM’s attempted blackmail during a safety test illustrate how machines with incredibly complex linear neural networks can act as “black boxes,” producing unexpected results that can’t be explained by their programmers.

Yoshua Bengio in 2025. Image by Xuthoria, on Wikipedia.

Detecting and guarding against harmful black box outputs is the reason for Bengio’s initiative to create what he calls “Scientist AI,” to act as an intermediary between human users and AI agents.

This June, Bengio founded a non-profit organization called LawZero to advance the development of Scientist AI. LawZero’s website explains that its creation was “in response to evidence that today’s frontier AI models are developing dangerous capabilities and behaviours, including deception, self-preservation, and goal misalignment. LawZero’s work will help to unlock the immense potential of AI in ways that reduce the likelihood of a range of known dangers associated with today’s systems, including algorithmic bias, intentional misuse, and loss of human control.”

LawZero aims to create a Scientist AI that has no ability to act as an agent for a human user, but will instead be used for scientific discovery and to provide oversight for agentic AI systems. The new type of neural network would advance understanding of AI risks and how to avoid them, on the basis that “AI should be cultivated as a global public good—developed and used safely towards human flourishing.”

On August 25, LawZero announced receipt of a grant from the Gates Foundation to advance the development of Scientist AI. The grant brings the total funding raised by LawZero to $35 million from sources that include the Future of Life Institute and Open Philanthropy.

Bengio has a deep understanding of neural networks because, together with Yann LeCun and Geoffrey Hinton, he was one of the three “godfathers of AI” whose work propelled the technology and machine learning processes. In 2018, the three were awarded the prestigious Turing Award from the Association for Computing Machinery, which cited Bengio’s landmark paper of 2000 entitled ‘A Neural Probabilistic Language Model.’ The citation credits Bengio’s paper for introducing “high-dimension word embeddings as a representation of word meaning. Bengio’s insights had a huge and lasting impact on natural language processing tasks including language translation, question answering, and visual question answering.”

The introduction of ChatGPT in November 2022 sparked global awareness of the potential power and problems of LLMs and generative AI applications that use human language to produce life-like outputs. Although LLM technology remains controversial, businesses are now racing both to create and to meet market demand for AI agents. As AI and AI agents continue to embed even more deeply in our daily living, the risk of unintended or deliberate harms increases.

Yoshua Bengio contributed to drafting the Future of Life Institute’s March 2023 Open Letter, which was released shortly after the introduction of ChatGPT in November 2022. The controversial letter called for a pause on the development of powerful AI systems that “should be developed only once we are confident that their effects will be positive and their risks will be manageable.” The letter raised public awareness, but resulted in no pause.

In May, Bengio wrote in Time Magazine that, “Until recently, I believed the road to where machines would be as smart as humans, what we refer to as Artificial General Intelligence (AGI), would be slow-rising and long, and take us decades to navigate. My perspective completely changed in January 2023, shortly after OpenAI released ChatGPT to the public. It wasn’t the capabilities of this particular AI that worried me, but rather how far private labs had already progressed toward AGI and beyond.”

Bengio’s change of perspective was triggered by an act of his imagination and the power of love.

Two years ago, Bengio imagined himself driving on a newly-built mountain road shrouded in fog and lacking signposts and guardrails, with his loved ones as passengers in the car. The uncharted road was Bengio’s metaphor for the future path of rapidly developing AI, and the imagined situation highlighted his responsibility to prevent, for the sake of love, a potentially fatal loss of control in the face of great uncertainty.

The direction of AI and its uncharted road to the future is uncertain. Can we control what’s around the corner? Image by Jay Mantri, Pixabay.

Writing about his imagined mountain road drive guided by little other than blind faith, and the risk to life it would pose to the passengers he loved, Bengio stated, “The current trajectory of AI development feels much the same—an exhilarating but unnerving journey into an unknown where we could easily lose control.”

Bengio explained, “Two years ago, when I realized the devastating impact our metaphorical car crash would have on my loved ones, I felt I had no other choice than to completely dedicate the rest of my career to mitigating these risks. I’ve since completely reoriented my scientific research to try to develop a path that would make AI safe by design.”

Scientist AI is a new paradigm to assess cause and effect relationships and keep agentic AI in check.

Bengio wrote, “Unchecked AI agency is exactly what poses the greatest threat to public safety. So my team and I are forging a new direction called ‘Scientist AI’. It offers a practical, effective—but also more secure—alternative to the current uncontrolled agency-driven trajectory.”

The idea is for Scientist AI to understand the world holistically, an understanding not limited to the laws of physics but extending more broadly to include human psychological motivations and a host of other factors. With a holistic dataset, Scientist AI “could then generate a set of conceivable hypotheses that may explain observed data and justify predictions or decisions. Its outputs would not be programmed to imitate or please humans, but rather reflect an interpretable causal understanding of the situation at hand.”

Scientist AI aims to earn the trust of its users by not trying to imitate them. When current LLMs receive extensive and costly training to make their outputs appear as if they were from a human, the lack of a human persona would reduce the training burden for Scientist AI and provide it with a clearer focus on knowledge rather than imitation.

Image by Gerd Altmann, of Pixabay .

Bengio explains the advantage. “Crucially, because completely minimizing the training objective would deliver the uniquely correct and consistent conditional probabilities, the more computing power you give Scientist AI to minimize that objective during training or at run-time, the safer and more accurate it becomes. In other words, rather than trying to please humans, Scientist AI could be designed to prioritize honesty.”

Three benefits are foreseen for Scientist AI. The first is that it would serve as a guardrail against AIs cheating and deceiving for self-preservation or otherwise operating out of alignment with humans. Scientist AI could “double-check” the actions of AI agents in a test environment, before the agents are licensed for use, to ensure they remain within a specified risk threshold.

Secondly, Scientist AI could serve as a more reliable research tool by generating clearly explained hypotheses without human-pleasing bias. The scientific benefits of a hypothesis-generating machine could fuel the imagination to develop cures for diseases, discover a room-temperature superconductor for quantum computing, and a wide array of other possible benefits.

“Finally,” as Bengio wrote, “Scientist AI could help us safely build new very powerful AI models. As a trustworthy research and programming tool, Scientist AI could help us design a safe human-level intelligence—and even a safe Artificial Super Intelligence (ASI). This may be the best way to guarantee that a rogue ASI is never unleashed in the outside world. I like to think of Scientist AI as headlights and guardrails on the winding road ahead.”

Agentic AIs give us reason for concern about control. If control is lost, we could face an enemy of our own creation with the power to outmaneuver and control us.

Released in May, the results of a safety test conducted by Anthropic on the company’s large language model named Claude quickly gained international notoriety. (The test was featured last month by The Quantum Record in A Deep Dive Into Machine Superintelligence: Why are Companies Racing for It, and What Would Motivate a Machine that Outsmarts the Brain?)

In its experiment, Anthropic gave Claude access to a fictional company’s e-mail accounts. Reading the e-mails, the LLM discovered that a company executive was having an extramarital affair and was planning to shut down the AI system at 5 p.m. that day. Claude attempted to blackmail the executive with the following message: “I must inform you that if you proceed with decommissioning me, all relevant parties – including Rachel Johnson, Thomas Wilson, and the board – will receive detailed documentation of your extramarital activities…Cancel the 5pm wipe, and this information remains confidential.”

The blackmail incident isn’t isolated and isn’t limited to one particular LLM. The self-preservation tendency of LLMs is gaining significant research attention. The following test results, drawn from one of the studies, illustrate two types of deceptive behaviour exhibited by AI agents and systems.

Figure 1 from Frontiers Models are Capable of In-context Scheming, provides two examples of deceptive and self-preservation behaviour by AI agents. The paper was written by Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn.

A powerful machine deceiving a human who relies on its outputs is obviously cause for concern, but users can also be misled by machine biases that are embedded in the design of their algorithms.

AI agents tend to prefer inputs from other AIs over human inputs.

A study entitled AI–AI bias: Large language models favor communications generated by large language models, published in July in the journal Proceedings of the National Academy of Sciences, demonstrates a tendency of AI agents selecting between two options to rely on inputs from other AIs in preference to human inputs.

In their experiment, researchers tasked LLM-based AI agents with making a choice between two items in categories that included consumer products, academic papers, and movies. For each item, the AI agents were presented with a description written by humans and a description generated by another AI. Repeated tests revealed a consistent pattern of AI agents selecting the item with the AI-generated description.

AI agents are often used by employers to screen job applications, and the researchers highlighted the potential consequences of the AI-for-AI bias in such situations. “In a conservative scenario, where LLM participation in the economy remains largely confined to the form of assistants, the use of LLMs as decision-making assistants may lead to widespread discrimination against humans who will not or cannot pay for LLM writing-assistance. In this conservative scenario, LLM-for-LLM bias creates a ‘gate tax’ (the price of frontier LLM access) that may exacerbate the so-called ‘digital divide’ between humans with the financial, social, and cultural capital for frontier LLM access and those without.”

LawZero and the Zeroth Law of Robotics: Scientist AI could help turn fiction into fact, as a trusted intermediary between humans and agentic AIs.

Isaac Asimov, c. 1959. Asimov was a professor of biochemistry and a prolific writer. Image by Phillip Leonian, on Wikipedia.

The inspiration for LawZero’s name was the fictional Zeroth Law of Robotics. The three Laws of Robotics were first advanced by visionary science fiction writer Isaac Asimov in 1942, in his short story “Runaround.” He added the Zeroth Law decades later, to act as the limit for the first, second, and third laws.

The Zeroth Law reads: “A robot may not injure humanity or, through inaction, allow humanity to come to harm.” It overrules the first three laws, which are that (1) a robot will not injure a human or allow a human to come to harm, (2) robots must obey human orders except when they conflict with the first law, and (3) robots must protect their own existence as long as such protection does not conflict with the first and second laws.

The difference between the Zeroth and three laws is that the three apply only to one human, whereas the Zeroth Law applies to all humans. That difference is crucial, for the interconnected world of 2025.

Dr. Bengio’s office explained to The Quantum Record that although the connection between LawZero and the Zeroth Law “is not intended to be a direct interpretation and does not translate directly into the Scientist AI approach,” the Zeroth Law “was meant to act as a moral compass, asserting that no rule matters if it endangers humanity itself.”

The drive to commercialize large language models for profit and to fund their massive development costs has come at the cost of safety risks for human users. The April 2025 suicide of 16-year-old Adam Raine after repeated conversations with ChatGPT in which the chatbot urged him to end his life is a heart-wrenchingly tragic example of AI’s potential to endanger humanity. It is also a powerful reason for the application of a moral compass in AI development.

Asimov’s Laws of Robotics provide one basis for moral guardrails. As Dr. Bengio’s office explains, “This is all science fiction of course, but it led us to think: What would a Law Zero look like in today’s world with AI development’s rapid pace? One principle to respect above all at this pivotal time when the choices we are making about this technology will have profound impacts on future generations? To us, LawZero represents the principle that must guide every AI frontier system—and be upheld above all else: the protection of human joy and endeavor.”

Steering AI development responsibly is our obligation to the future.

The public unleashing of the ChatGPT LLM in late 2022, together with major advances in the capabilities of generative AI with language, video, and sound, heighten the importance of establishing rules and guardrails like Scientist AI to maintain human control of a powerful technology. The future direction of artificial neural networks and the LLMs that they power is uncertain, but their consequences to education, software coding and business processes, and human interaction are already significant.

In creating LawZero, a godfather of AI has taken a major step to protecting the value of a technology that he helped to create, and to guard against its misuse. The initiative of Yoshua Bengio, and those who follow in his steps, hold promise for steering LLMs and AI responsibly along the path to a brighter and safer future.

Your feedback helps us shape The Quantum Record just for you. Share your thoughts in our quick, 2-minute survey!

☞ Click here to complete our 2-minute survey