+"And so just because an AI seems to behaving well, doesn't mean it's aligned," Chloë was explaining. "If we train AI with human feedback ratings, we're not just selecting for policies that do tasks the way we intended. We're also selecting for policies that _trick human evaluators into giving high ratings_. In the limit, you'd expect those to dominate. 'Be good' strategies can't compete with 'look good' strategies in a looking-good contest—but in the current paradigm, looking good is the only game in town. We don't know how these systems work in the way that we know how ordinary software works; we only know how to train them."
+
+"So then we're just screwed, right?" said Jake in the tone of an attentive student. They were in a conference room on the Magma campus on Monday. After fixing the logging regex and overwriting the evidence with puppies, he had spent the weekend catching up with the 'AI safety' literature. Honestly, some of it had been better than he expected. Just because Chloë was nuts didn't mean there was nothing intelligent to be said about risks from future systems.
+
+"I mean, probably," said Chloë. She was beaming. Jake's plan to distract her from the investigation by asking her to bring him up to speed on AI safety seemed to be working perfectly.
+
+<span style="float: right; margin: 0.4pc;">
+<a href="/images/fake_deeply-chloe_and_jake.png"><img src="/images/fake_deeply-chloe_and_jake-smaller.png" width="450"></a><br/>
+<span class="photo-credit" style="float: right;">Illustration by [Stable Diffusion XL 1.0](https://stability.ai/stable-diffusion)</span>
+</span>
+
+"But not necessarily," she continued. There are a few avenues of hope—at least in the not-wildly-superhuman regime. One of them has to do with the fragility of deception.
+
+"The thing about deception is, you can't just lie about one thing. Everything is connected to each other in the Great Web of Causality. If you lie about one thing, you also have to lie about the evidence pointing to that thing, and the evidence pointing to that evidence, and so on, recursively covering up the coverups. For example ..." she trailed off. "Sorry, I didn't rehearse this; maybe you can think of an example."
+
+Jake's heart stopped. She had to be toying with him, right? Indeed, Jake could think of an example. By his count, he was now three layers deep into his stack of coverups and coverups-of-coverups (by writing the bell character bug, attributing it to Code Assistant, and overwriting the incriminating videos with puppies). Four, if you counted pretending to give a shit about AI safety. But now he was done ... right?
+
+No! Not quite, he realized. He had overwritten the videos, but the object metadata would still show them with a last-modified timestamp of Friday evening (when he had gotten his puppy-overwriting script working), not the timestamp of their actual creation (which Chloë had from the reverse-proxy logs). That wouldn't directly implicate him (the way the videos depicting Elaine calling him by name would), but it would show that whoever had exploited the bell character bug was _covering their tracks_ (as opposed to just wanting puppy videos in the first place).