-[TODO—
- * Chloë is explaining deceptive alignment. If a model does well on our evals, how do we know whether it's actually doing the right thing, or just trying to fool us?
- * Jake had explicitly asked to get brought up to speed on AI safety, to stall on whatever uncomfortable audit questions she might have—the puppies are prepared, but it still seemed like a good idea
- * "So then we're just doomed then, right?" Jake is trying to be agreeable and flattering. He's fixed the regex and overwritten his porn with puppies, and spent the weekend reading AI safety papers and blog posts. Some of it was honestly better than he expected. This Chloë being insane didn't invalidate the whole field as having serious points to make.
- * "Maybe not." There are two ways to pass all the evals: do things the right way, or be pervasively deceptive. The thing is, policies are trained continuously via gradient descent. The purely honest policy and the purely deceptive policy look identical on evals, but in between, the model would have to learn how to lie, and lie about lying, and cover-up coverups. (Chloë lapses into Yuddite speak about the "Great Web of Causality.") Could we somehow steer into the honest attractor?
- * That's why she believes in risk paranoia. If situational awareness is likely to emerge at some point, she doesn't want to rule it out now. AI is real; it's not just a far-mode in-the-future thing.
- * Jake sees the uncomfortable analogy to his own situation. He tries to think of what other clue he might have left, while the conversation continues ...
- * The Last-Modified dates! They're set by the system, the API doesn't offer a way to backdate them.
- * Maybe she won't notice the dates? Possible (but someone persnickety enough to have found the log discrepancy would probably check)
- * Merely noticing the dates won't directly implicate him (the way the video screaming "Jake!" would), although it would indicate the poltergeist covering its tracks from the investigation (rather than just wanting puppies in the first place)
- * Are the buckets versioned?! Probably not, right—it would be wasteful to version a video bucket. On the other hand, Multigen isn't supposed to write twice, maybe someone left versioning on as a default template ...
- * They did.
- * Jake muses philosophically about the analogy, and say something ambiguously indicating intent to come clean.
- THE END
-]
+"And so just because an AI seems to behaving well, doesn't mean it's aligned," Chloë was explaining. "If we train AI with human feedback ratings, we're not just selecting for policies that do tasks the way we intended. We're also selecting for policies that _trick human evaluators into giving high ratings_. In the limit, you'd expect those to dominate. 'Be good' strategies can't compete with 'look good' strategies in a looking-good contest—but in the current paradigm, looking good is the only game in town. We don't know how these systems work in the way that we know how ordinary software works; we only know how to train them."
+
+"So then we're just screwed, right?" said Jake in the tone of an attentive student. They were in a conference room on the Magma campus on Monday. After fixing the logging regex and overwriting the evidence with puppies, he had spent the weekend catching up with the 'AI safety' literature. Honestly, some of it had been better than he expected. Just because Chloë was nuts didn't mean there was nothing intelligent to be said about risks from future systems.
+
+"I mean, probably," said Chloë. She was beaming. Jake's plan to distract her from the investigation by asking her to bring him up to speed on AI safety seemed to be working perfectly.
+
+<span style="float: right; margin: 0.4pc;">
+<a href="/images/fake_deeply-chloe_and_jake.png"><img src="/images/fake_deeply-chloe_and_jake-smaller.png" width="450"></a><br/>
+<span class="photo-credit" style="float: right;">Illustration by [Stable Diffusion XL 1.0](https://stability.ai/stable-diffusion)</span>
+</span>
+
+"But not necessarily," she continued. There are a few avenues of hope—at least in the not-wildly-superhuman regime. One of them has to do with the fragility of deception.
+
+"The thing about deception is, you can't just lie about one thing. Everything is connected to each other in the Great Web of Causality. If you lie about one thing, you also have to lie about the evidence pointing to that thing, and the evidence pointing to that evidence, and so on, recursively covering up the coverups. For example ..." she trailed off. "Sorry, I didn't rehearse this; maybe you can think of an example."
+
+Jake's heart stopped. She had to be toying with him, right? Indeed, Jake could think of an example. By his count, he was now three layers deep into his stack of coverups and coverups-of-coverups (by writing the bell character bug, attributing it to Code Assistant, and overwriting the incriminating videos with puppies). Four, if you counted pretending to give a shit about AI safety. But now he was done ... right?
+
+No! Not quite, he realized. He had overwritten the videos, but the object metadata would still show them with a last-modified timestamp of Friday evening (when he had gotten his puppy-overwriting script working), not the timestamp of their actual creation (which Chloë had from the reverse-proxy logs). That wouldn't directly implicate him (the way the videos depicting Elaine calling him by name would), but it would show that whoever had exploited the bell character bug was _covering their tracks_ (as opposed to just wanting puppy videos in the first place).
+
+But the object storage API probably provided a way to edit the metadata and update the last-modified time, right? This shouldn't even count as a fourth–fifth coverup; it was something he should have included in his coverup script.
+
+Was there anything else he was missing? The object storage cluster did have a optional "versioning" feature. When activated for a particular bucket, it would save previous versions of an object rather than overwriting them. He had assumed versioning wasn't on for the bucket that Multigen was using. (It wouldn't make sense; the workflow didn't call for writing the same object name more than once.)
+
+"I think I get the idea," said Jake in his attentive student role. "I'm not seeing how that helps us. Maybe you could explain." While she was blathering, he could multitask between listening, and (first) looking up how to edit the last-modified timestamps and (second) double-checking that the Multigen bucket didn't have versioning on.
+
+"Right, so if a model is behaving well according to all of our deepest and most careful evaluations, that could mean it's doing what we want, but it could be elaborately deceiving us," said Chloë. "Both policies would be highly rated. But the training process has to discover these policies continuously, one gradient update at a time. If the spectrum between the 'honest' policy and a successfully deceptive policy consists of less-highly rated policies, maybe gradient descent will stay—or could be made to stay—in the valley of honest policies, and not climb over the hill into the valley of deceptive policies, even though those would ultimately achieve a lower loss."
+
+"Uh huh," Jake said unhappily. The object storage docs made clear that the `Last-Modified` header was set automatically by the system; there was no provision for users to set it arbitrarily.
+
+"Imagine you're training your AI to act as a general home assistant, to run everything in a user's household in a way that the user rates highly," Chloë continued. "If—I don't know, say, the family cat dies, that's a negative reward—maybe a better feeding schedule or security monitoring could have prevented it. But if the cat dies and the system _tries to cover it up_—says the cat is out for a walk right now to avoid telling you the bad news—that's going to be an even larger negative reward when the deception is discovered.
+
+"Uh _huh_," Jake said, more unhappily. It turned out that versioning _was_ on for the bucket. (Why? But probably whoever's job it was to set up the bucket had instead asked, Why not?) A basic `GET` request for the file name would return puppies, but any previous revisions were still available for anyone who thought to query them.
+
+"If the system is trained to pass rigorous evaluations, a deceptive policy has to do a lot more work, different work, to pass the evaluations," Chloë said. "In the limit, that could mean—using Multigen-like capabilities to make videos to convince you that the cat is there while you're on vacation? Constructing a realistic cat-like robot to fool you when you get back? Maybe this isn't the best illustrative example. The point is, small, 'shallow' deceptions aren't stable. The set of policies that do well on evaluations comes in two disconnected parts: the part that tells the truth, and the part that—not just lies, but, um—"
+
+Jake's attentive student persona finished the thought. "But spins up an entire false reality, as intricate as it needs to be to protect itself. If you're going to fake anything, you need to fake deeply."
+
+"Exactly, you get it!" Chloë was elated. "You know, when I called you last week, I was worried you thought I was nuts. But you see the value of constant vigilance now, right?—why we need to investigate and debug things like this until we understand what's going on, instead of just shrugging that neural nets do weird things sometimes. If the landscape of policies looks something like what I've described, catching the precursors to deception early could be critical—to raise the height of the loss landscape between the honest and deceptive policies, before frontier AI development plunges into the latter. To get good enough at catching lies, for honesty to be the best policy."
+
+"Yeah," Jake said. "I get it."
+
+"Anyway, now that you understand the broader context, I had some questions about Multigen," said Chloë. "How is the generated media stored? I'm hoping it's still possible to see what was generated in the requests that escaped logging."
+
+There it was. Time to stall, if he could. "Um ... I _think_ it goes into the object storage cluster, but I'm not really familiar with that part of the codebase," Jake lied. "Could we set up another meeting after I've looked at the code?"
+
+She smiled. "Sure."
+
+Purging the videos for real wasn't obviously possible given the level of access he currently had, but he had just bought a little bit of time. Could he convince whoever's job it was to turn off versioning for the Multigen bucket, without arousing suspicion? Probably? There would probably be other avenues to stall or misdirect Chloë. He'd think of something ...
+
+But it was a little unnerving that he kept _having_ to think of something. Was there something to Chloë's galaxy-brained philosophical ramblings? Was there some sense in which it really was one or the other—a policy of truth, or a policy of lies?
+
+He wasn't scared of the robot apocalypse any time soon. But who could say but that someday—many, many years from now—a machine from a different subgenre of science fiction would weigh decisions not unlike the ones he faced now? He stifled a nervous laugh, which was almost a sob.
+
+"Are you okay?" Chloë asked.
+
+"Well—"