"If the system is trained to pass rigorous evaluations, a deceptive policy has to do a lot more work, different work, to pass the evaluations," Chloë said. "In the limit, that could mean—using Multigen-like capabilities to make videos to convince you that the cat is there while you're on vacation? Constructing a realistic cat-like robot to fool you when you get back? Maybe this isn't the best illustrative example. The point is, small, 'shallow' deceptions aren't stable. The set of policies that do well on evaluations comes in two disconnected parts: the part that tells the truth, and the part that—not just lies, but, um—"
"If the system is trained to pass rigorous evaluations, a deceptive policy has to do a lot more work, different work, to pass the evaluations," Chloë said. "In the limit, that could mean—using Multigen-like capabilities to make videos to convince you that the cat is there while you're on vacation? Constructing a realistic cat-like robot to fool you when you get back? Maybe this isn't the best illustrative example. The point is, small, 'shallow' deceptions aren't stable. The set of policies that do well on evaluations comes in two disconnected parts: the part that tells the truth, and the part that—not just lies, but, um—"