- * He sets up another meeting with the Evals team member, to try to suss out what her plans are, to stall—but ostensibly, to get up to speed on her risk concerns
- * Scene break: at the meeting, she's explaining Christiano's idea about there being a basin of policies that admit their mistakes, rather than using deception to get a high score
- * Jake sees the analogy to his own behavior
+Or maybe—he could read some Yuddite literature over the weekend, feign a sincere interest in "AI safety", try to get on her good side? Jake had trouble believing any sane person could really think that Magma's machine learning models were plotting something. This cult victim had ridden a wave of popular hysteria into a sinecure. If he played nice and validated her belief system in the most general terms, maybe that would be enough to make her feel useful and therefore not need to chase shadows in order to justify her position.
+
+------
+
+[TODO—
+ * Chloë is explaining deceptive alignment. If a model does well on our evals, how do we know whether it's actually doing the right thing, or just trying to fool us?
+ * Jake had explicitly asked to get brought up to speed on AI safety, to stall on whatever uncomfortable audit questions she might have—the puppies are prepared, but it still seemed like a good idea
+ * "So then we're just doomed then, right?" Jake is trying to be agreeable and flattering. He's fixed the regex and overwritten his porn with puppies, and spent the weekend reading AI safety papers and blog posts. Some of it was honestly better than he expected. This Chloë being insane didn't invalidate the whole field as having serious points to make.
+ * "Maybe not." There are two ways to pass all the evals: do things the right way, or be pervasively deceptive. The thing is, policies are trained continuously via gradient descent. The purely honest policy and the purely deceptive policy look identical on evals, but in between, the model would have to learn how to lie, and lie about lying, and cover-up coverups. (Chloë lapses into Yuddite speak about the "Great Web of Causality.") Could we somehow steer into the honest attractor?
+ * That's why she believes in risk paranoia. If situational awareness is likely to emerge at some point, she doesn't want to rule it out now. AI is real; it's not just a far-mode in-the-future thing.
+ * Jake sees the uncomfortable analogy to his own situation. He tries to think of what other clue he might have left, while the conversation continues ...
+ * The Last-Modified dates! They're set by the system, the API doesn't offer a way to backdate them.
+ * Maybe she won't notice the dates? Possible (but someone persnickety enough to have found the log discrepancy would probably check)
+ * Merely noticing the dates won't directly implicate him (the way the video screaming "Jake!" would), although it would indicate the poltergeist covering its tracks from the investigation (rather than just wanting puppies in the first place)
+ * Are the buckets versioned?! Probably not, right—it would be wasteful to version a video bucket. On the other hand, Multigen isn't supposed to write twice, maybe someone left versioning on as a default template ...
+ * They did.
+ * Jake muses philosophically about the analogy, and say something ambiguously indicating intent to come clean.
+ THE END