"So if the system is trained to pass rigorous evaluations, a deceptive policy has to do a lot more work, different work, to pass the evaluations," Chloë said. "Maybe it buys a new, similar-looking vase to put in the same place, and forges the payment memo to make it look like the purchase was for cleaning supplies, and so on. The point is, small, 'shallow' deceptions aren't stable. The set of policies that do well on evaluations comes in two disconnected parts: the part that tells the truth, and the part that—not just lies, but, um—"
"So if the system is trained to pass rigorous evaluations, a deceptive policy has to do a lot more work, different work, to pass the evaluations," Chloë said. "Maybe it buys a new, similar-looking vase to put in the same place, and forges the payment memo to make it look like the purchase was for cleaning supplies, and so on. The point is, small, 'shallow' deceptions aren't stable. The set of policies that do well on evaluations comes in two disconnected parts: the part that tells the truth, and the part that—not just lies, but, um—"