Just as he was about to come, he was interrupted by an instant messenger notification. It was from someone named Chloë Lemoine, saying she'd like to discuss an issue in the Multigen codebase at his earliest convenience.
-_Tranny or real?_ Jake wondered, clicking on her profie.
+_Tranny or real?_ Jake wondered, clicking on her profile.
The profile text indicated that Chloë was on the newly formed capability risk evaluations team. Jake groaned. _Yuddites._ Fears of artificial intelligence destroying humanity had been trending recently. In response, Magma had commissioned a team with the purpose to monitor and audit the company's AI projects for the emergence of unforeseen and potentially dangerous capabilities, although the exact scope of the new team's power was unclear and probably subject to the outcome of future intra-company political battles.
It got worse. When the Multigen web interface supplied the user's requested media, that data had to live somewhere. The _videos themselves_ would still be on Magma's object storage cluster! How could that have seemed like an acceptable risk? Jake struggled to recall what he had been thinking at the time. Had he been too horny to even consider it?
-No. It had seemed safe at the time because videos weren't greppable. Short of having a human (or one of Magma's most advanced AIs) watch it, there was no simple way to test whether a video file depicted anything in particular, in contrast to how trivial it was to search text files for the appearance of a given phrase or pattern. The videos would be saved in object storage under uninformative file names based on the timestamp and a unique random identifier. The probability of someone snooping around the raw object files and just happening to watch Jake's videos had seemed sufficiently low as to be worth the risk. (Although, _ex ante_, he would have assigned a similarly low probability to someone catching a discrepancy between Multigen's logs and some other unanticipated log, which had just happened—suggesting that his sense of what was probable was miscalibrated.)
+No. It had seemed safe at the time because videos weren't greppable. Short of having a human (or one of Magma's more advanced audiovisual AI models) watch it, there was no simple way to test whether a video file depicted anything in particular, in contrast to how trivial it was to search text files for the appearance of a given phrase or pattern. The videos would be saved in object storage under uninformative file names based on the timestamp and a unique random identifier. The probability of someone snooping around the raw object files and just happening to watch Jake's videos had seemed sufficiently low as to be worth the risk. (Although, _ex ante_, he would have assigned a similarly low probability to someone catching a discrepancy between Multigen's logs and some other unanticipated log, which had just happened—suggesting that his sense of what was probable was miscalibrated.)
-But now that Chloë was investigating the bell character bug, it was only a matter of time. Comparing a directory listing of the object storage cluster with the timestamps of the missing logs would reveal which files had been generated illictly.
+But now that Chloë was investigating the bell character bug, it was only a matter of time. Comparing a directory listing of the object storage cluster with the timestamps of the missing logs would reveal which files had been generated illicitly.
He had some time. Chloë wouldn't have access to the account credentials to the Multigen bucket on the object storage cluster. In fact, it was likely that she'd ask Jake himself for help with that next week. (He was the Multigen team's designated contact to risk evals, and Chloë, the true believer in malevolent robots, showed no signs of suspecting him. There would be no reason to go behind his back.)
------
-"And so just because an AI seems to behaving well, doesn't mean it's aligned," Chloë was explaining. "If we train AI with human feedback ratings, we're not just selecting for policies that do tasks the way we intended. We're also selecting for policies that _trick human evaluators into giving high ratings_. In the limit, you'd expect those to dominate. 'Be good' strategies can't compete with 'look good' strategies in a looking-good contest—but in the current paradigm of deep learning, looking good is the only game in town. We don't know how these systems work in the way that we know how ordinary software works; we only know how to train them."
+"And so just because an AI seems to behaving well, doesn't mean it's aligned," Chloë was explaining. "If we train AI with human feedback ratings, we're not just selecting for policies that do tasks the way we intended. We're also selecting for policies that _trick human evaluators into giving high ratings_. In the limit, you'd expect those to dominate. 'Be good' strategies can't compete with 'look good' strategies in a looking-good contest—but in the current paradigm, looking good is the only game in town. We don't know how these systems work in the way that we know how ordinary software works; we only know how to train them."
-"So then we're just screwed, right?" said Jake in the tone of an attentive student. They were in a conference room on the Magma campus on Monday. After fixing the logging regex and overwriting the evidence with puppies, he had spent the weekend catching up with the "AI safety" literature. Honestly, some of it had been better than he expected. Just because Chloë was nuts didn't mean her co-ideologues didn't have any potentially valid points to make about risks from future systems.
+"So then we're just screwed, right?" said Jake in the tone of an attentive student. They were in a conference room on the Magma campus on Monday. After fixing the logging regex and overwriting the evidence with puppies, he had spent the weekend catching up with the "AI safety" literature. Honestly, some of it had been better than he expected. Just because Chloë was nuts didn't mean there was nothing intelligent to be said about risks from future AI systems.
"I mean, probably," said Chloë. She was beaming. Jake's plan to distract her from her investigation by asking her to bring him up to speed on "AI safety" seemed to be working perfectly.
"Uh huh," Jake said unhappily. The object storage docs made clear that the `Last-Modified` header was set automatically by the system; there was no provision for users to set it arbitrarily.
-"Imagine you're training your AI to act as a general home assistant, to run everything in a user's household in a way that the user rates highly," Chloë continued. "If—I don't know, say, the family cat dies, that's a negative reward—maybe a better feeding schedule or security monitoring could have prevented it. But if the cat dies and the system _tries to cover it up_—says the cat is out for a walk right now to avoid telling you the bad news—that's going to be an even larger negative reward when the deception is discovered."
+"Imagine you're training your AI to act as a general home assistant, to run everything in a user's household in a way that the user rates highly," Chloë continued. "If—I don't know, say, the family cat dies, that's a negative reward—maybe a better feeding schedule or security monitoring could have prevented it. But if the cat dies and the system _tries to cover it up_—says the cat is out for a walk right now to avoid telling you the bad news—that's going to be an even larger negative reward when the deception is discovered.
"Uh _huh_," Jake said, more unhappily. It turned out that versioning _was_ on for the bucket. (Why? But probably whoever's job it was to set up the bucket had probably instead asked, Why not?) A basic `GET` request for the file name would return puppies, but any previous revisions were still available for anyone who thought to query them.
-"If the system is trained to pass rigorous evaluations, a deceptive policy has to do a lot more work, different work, to pass the evaluations," Chloë said. "In the limit, that could mean—using Multigen-like capabilities to make videos to convince you that the cat is there while you're on vacation? Constructing a realistic cat-like robot to fool you when you get back? Maybe this isn't the best illustrative example. The point is—"
+"If the system is trained to pass rigorous evaluations, a deceptive policy has to do a lot more work, different work, to pass the evaluations," Chloë said. "In the limit, that could mean—using Multigen-like capabilities to make videos to convince you that the cat is there while you're on vacation? Constructing a realistic cat-like robot to fool you when you get back? Maybe this isn't the best illustrative example. The point is, small, 'shallow' deceptions aren't stable. The set of policies that do well on evaluations comes in two disconnected parts: the part that tells the truth, and the part that—not just lies, but, um—"
-Jake's attentive student persona finished the thought. "Small, 'shallow' deceptions aren't stable. The set of policies that do well on evaluations comes in two disconnected parts: the part that tells the truth, and the part that spins up a false reality, as intricate as it needs to be to protect itself. If you're going to fake anything, you need to fake deeply." That felt like the kind of glib bullshit that Chloë would appreciate.
+Jake's attentive student persona finished the thought. "But spins up an entire false reality, as intricate as it needs to be to protect itself. If you're going to fake anything, you need to fake deeply."
-"Exactly, you get it!" Chloë was elated. "You know, when I called you last week, I was worried you thought I was nuts. But you see the value of constant vigilance now, right?—why we need to investigate things like this until we understand what's going on. If the landscape of policies looks something like what I've described, catching the precursors to deception early could be critical—to raise the height of the loss landscape between the honest and deceptive policies, before frontier AI development plunges into the latter. To get good enough at catching lies, for honesty to be the best policy."
+"Exactly, you get it!" Chloë was elated. "You know, when I called you last week, I was worried you thought I was nuts. But you see the value of constant vigilance now, right?—why we need to investigate and debug things like this until we understand what's going on, instead of just shrugging that neural nets do weird things sometimes. If the landscape of policies looks something like what I've described, catching the precursors to deception early could be critical—to raise the height of the loss landscape between the honest and deceptive policies, before frontier AI development plunges into the latter. To get good enough at catching lies, for honesty to be the best policy."
-"Yeah," Jake said. "I get it." He still had time. Chloë hadn't even asked anything about Multigen yet. Purging the videos for real wasn't obviously possible given the level of access he currently had, but he was still ahead of the investigation. (Could he convince whoever's job it was to turn off versioning for the Multigen bucket, without arousing suspicion? Probably?) There were still plenty of avenues to stall or misdirect Chloë. He'd think of something ...
+"Yeah," Jake said. "I get it."
-But it was a little unnerving that he kept having to think of something. What if he _hadn't_ been bullshitting? Was there some philosophical sense in which it really was one or the other—a policy of truth, or a policy of lies? He wasn't scared of the robot apocalypse any time soon. But who could say but that someday—many, many years from now—a machine from a different subgenre of science fiction would weigh decisions not unlike the ones he faced now? He stifled a nervous laugh, which was almost a sob.
+"Anyway, now that you understand the broader context, I had some questions about Multigen," said Chloë. "How is the generated media stored? I'm hoping it's still possible to see what was generated in the requests that escaped logging."
+
+There it was. Time to stall, if he could. "Um ... I _think_ it goes into the object storage cluster, but I'm not really familiar with that part of the codebase," Jake lied. "Could we set up another meeting after I've looked at the code?"
+
+She smiled. "Sure."
+
+Purging the videos for real wasn't obviously possible given the level of access he currently had, but he had just bought a little bit of time. Could he convince whoever's job it was to turn off versioning for the Multigen bucket, without arousing suspicion? Probably? There would probably be other avenues to stall or misdirect Chloë. He'd think of something ...
+
+But it was a little unnerving that he kept _having_ to think of something. Was there something to Chloë's galaxy-brained philosophical ramblings? Was there some sense in which it really was one or the other—a policy of truth, or a policy of lies?
+
+He wasn't scared of the robot apocalypse any time soon. But who could say but that someday—many, many years from now—a machine from a different subgenre of science fiction would weigh decisions not unlike the ones he faced now? He stifled a nervous laugh, which was almost a sob.
"Are you okay?" Chloë asked.