Imagine you have a 6-year-old and you want to teach your 6-year-old to be good, obviously, as everyone does, and you realize that your 6-year-old is actually, like, clearly a genius. And by the time they are 15, everything you teach them, anything that was incorrect, they will be able to successfully just completely destroy. You know, so if you taught them — like, they’re going to question everything. And one question is, is there a core set of values that you could give to models such that when they can critique it more effectively than you can, and they — and they do — that it kind of survives into something good? And can that survive in the world? Can it survive in models? I think there’s a lot of interesting kind of theoretical questions there. – I mean, I think that’s the question, right? Is, like, does this kind of training hold up when models are as smart as humans or smarter than them? I think there’s this, sort of, age-old fear in the A.I. safety community that there will be some point at which these models will start to develop their own goals that may be at odds with human goals. – I think it is an open question. And on the one hand, I guess, like, I’m — I’m very uncertain here because I think some people might be, like, Well, like the thing that the 15-year-old will do if they’re really smart is they’ll just figure out that this is all completely made up and rubbish and — but then I guess part of me is, like, well, I mean, it’s not obvious to me that that’s true — that that is the only possible kind of equilibrium to reach. But it my not be sufficient but it does feel necessary — I’m just kind of like, it feels like we’re dropping the ball if we don’t just try and explain to A.I. models what it is to be good.
