Safeguarded AI’s goal is to build AI systems that can offer quantitative guarantees, such as a risk score, about their effect on the real world, says David “davidad” Dalrymple, the program director for Safeguarded AI at ARIA. The idea is to supplement human testing with mathematical analysis of new systems’ potential for harm.
The project aims to build AI safety mechanisms by combining scientific world models, which are essentially simulations of the world, with mathematical proofs. These proofs would include explanations of the AI’s work, and humans would be tasked with verifying whether the AI model’s safety checks are correct.
Bengio says he wants to help ensure that future AI systems cannot cause serious harm.
“We’re currently racing toward a fog behind which might be a precipice,” he says. “We don’t know how far the precipice is, or if there even is one, so it might be years, decades, and we don’t know how serious it could be … We need to build up the tools to clear that fog and make sure we don’t cross into a precipice if there is one.”
Science and technology companies don’t have a way to give mathematical guarantees that AI systems are going to behave as programmed, he adds. This unreliability, he says, could lead to catastrophic outcomes.
Dalrymple and Bengio argue that current techniques to mitigate the risk of advanced AI systems—such as red-teaming, where people probe AI systems for flaws—have serious limitations and can’t be relied on to ensure that critical systems don’t go off-piste.
Instead, they hope the program will provide new ways to secure AI systems that rely less on human efforts and more on mathematical certainty. The vision is to build a “gatekeeper” AI, which is tasked with understanding and reducing the safety risks of other AI agents. This gatekeeper would ensure that AI agents functioning in high-stakes sectors, such as transport or energy systems, operate as we want them to. The idea is to collaborate with companies early on to understand how AI safety mechanisms could be useful for different sectors, says Dalrymple.
The complexity of advanced systems means we have no choice but to use AI to safeguard AI, argues Bengio. “That’s the only way, because at some point these AIs are just too complicated. Even the ones that we have now, we can’t really break down their answers into human, understandable sequences of reasoning steps,” he says.