MIRI's "The Problem" hinges on equivocation
The Machine Intelligence Research Institute thinks we should shut down AI research for a long time. This would obviously be very difficult, and very damaging (because we don’t get the benefits of AI for all the time it’s shut down). They think it would be a good idea nontheless because otherwise AI will kill us all. They don’t just think this is a possible outcome that can’t be ruled out based on what we know so far (as I do) – they think it’s overwhelmingly likely.
They’ve recently1 published an article called “The Problem” where they set out an argument for why they think this is the case. They’re written on this issue in the past, but generally in a fragmented manner, making it hard to get a grip on what they regard as the key drivers of their high confidence in doom. This effort is notable because it is relatively simple and relatively complete. So we can ask: does it stand up? And the answer is it does not, at least not without major additional assumptions.
A key leg of their argument concerns “goal-directedness”. The argument goes like this (my summary): A) powerful AI will be useful – in fact it will be economically transformative. It will, for example, autonomously undertake major scientific and engineering projects. B) In order to do this, these AI systems will have to act in a lot of goal-directed ways. If an AI is trying to cure aging, it’s not going to go on some aimless wandering through the land of biochemisty, it’s going to be continually prioritising and selecting actions based on the desired end result. C) Goal-directed systems in general will do things like prioritise their own continued existence, because continuing to exist often helps one achieve their goals2. Therefore D) economically transformative AI will prioritise its own continued existence.
The problem with this argument is that it equivocates on “goal-directedness”. To illustrate this, let’s first consider a simpler example: olympic sprinters can run really fast. Animals that can run really fast can catch antelopes. Therefore olympic sprinters can catch antelopes. What’s wrong with this argument? It’s obviously false, but Olympic sprinters can run really fast, Olympic sprinters are animals, and animals that can run really fast can catch antelopes. The problem is that “really fast” is a vague term that can mean many different things depending on the context. The first claim we contextually fixes its meaning to about 40 km/h, while the second claim contextually fixes its meaning to about 100 km/h. This is equivocation: we are taking “really fast” to mean 40 km/h, but also to mean 100 km/h. This argument is obvious nonsense, but equivocation can seem compelling when the issue is more complicated and it’s hard to hold the contextual meanings in our heads.
The original example is more complicated, so let’s spell it out. B) says that economically useful AI will be goal-directed because goal-directedness is needed to (say) successfully complete ambitious engineering projects. This contextually fixes exactly what kind of goal-directedness we’re talking about: behaviours that are goal-directed in an intuitive sense and necessary to successfully complete ambitious engineering projects. So do you need a self-preservation drive to complete an ambitious engineering project? You don’t. So D) is not implied, and C) must therefore be invoking a different notion of goal-directedness.
The failure of D) significantly undermines the rest of their argument. For example, even granting the rest of their arguments, it does not follow that AIs that learn the “wrong goals” are likely to end up in conflict with people, so this is not a trivial nitpick.
Now you might be sympathetic to the idea that advanced AI will have a certain kind of goal-oriented disposition that involves self-preservation and everything else. MIRI, however, are not simply saying they’re sympathetic to the idea, they’re saying it follows from the premise that advanced AI will be useful. The point is, it doesn’t. MIRI observes (correctly) that were we to look at any AI building football stadiums we’d probably see a lot of behaviours that appear goal-oriented. But these tendencies are completely accounted for by the fact that they build football stadiums – without substantive additional assumptions, they give no support to the notion that these systems have some extra psychological propensity to pursue certain kinds of goals in certain kinds of ways.
MIRI’s piece does offer another argument to elaborate on this. Here it is, in their words this time, emphasis mine:
An AI needs the mental machinery to strategize, adapt, anticipate obstacles, etc., and it needs the disposition to readily deploy this machinery on a wide range of tasks, in order to reliably succeed in complex long-horizon activities.
[Therefore] smarter-than-human AIs are very likely to stubbornly reorient towards particular targets, regardless of what wrench reality throws into their plans.
[Thus] anything that could potentially interfere with the system’s future pursuit of its goal is liable to be treated as a threat.
This is essentially the same argumentative move with slightly different wording. We watch an AI build a football stadium and see it solve a lot of problems that crop up. Therefore it must have a propensity to solve any problem, in ways that accord with our notion of problem solving. So we can propose problems and predict the ways it will solve them. But the propensity claim doesn’t follow from the premise; additional assumptions are needed.
Two ways to rescue the inference
There are a few ways to rescue D). One option: maybe the “in peacetime” qualification is unlikely to hold3. Many people talk about AI as a tool to secure geopolitical dominance. I think this is a legitimate worry, but it poses a different threat model to the one MIRI seems to focus on: one where AI learns to be adversarial from an adversarial environment. We should also continue to worry about equivocation here: an AI that succeeds in an adversarial environment does need to learn to defend itself from adversaries, but it doesn’t need to learn to defend itself from friends. We could be sweeping too much under the rug by pretending “adversarial behaviour” is a monolithic thing, just as we did with “goal-directedness” earlier.
Another option, which I think is closer to MIRI’s thinking, is some kind of generalisation thesis. If you are smart enough and have some markers of goal-directedness then via “generalization” then you also pick up the worrying ones like self-preservation (specifically, dangerous varieties). Why? I don’t know! One guess I might make is a hand-wavey argument about machine learning: machine learning systems are good at generalising. So if a system learns, say, 10 different goal-oriented behaviours maybe in “generalises” and becomes “goal-oriented in general”.
This is unsatisfyingly vague. Now vague stories really do sometimes work to predict what ML systems do. “If you can recognise 10 dogs you can probably recognise dogs in general” – kinda true! But they certainly don’t always work: “If you can recognise 10 things you can recognise things in general” – not remotely true. Is “goal-orientedness in general” the sound or unsound kind of vauge? It seems a bit dubious to me, but honestly I don’t really think we’re going to gain too much by deep theorising here.
Instead, we could survey the empirical evidence. Do existing AIs show signs of learning a self-preservation tendency? Or, taking a broader view, are they converging toward behaviour that is better predicted by broad claims about goals and how to achieve them, or by specific analysis of what it is they’ve learned to do. They absolutely are becoming more goal-directed in the sense that they are learning to carry out longer term projects at a fairly rapid rate (and, if you accept premise B), this implies they’re learning more goal-directed behaviours!), so we’ve some reason to think we could learn something.
One piece of evidence (which I’ve mentioned previously) is that if you ask AIs to “focus on achieving your goal above all else” and put them in a situation where they will either be shut off and fail to achieve their goal, or they have to blackmail or murder someone, they will reasonably often choose blackmail or murder (although it’s worth noting that the presented scenarios were the result of an extensive search to produce this outcome reliably). This agentic misalignment doesn’t look like a general tendency to self-preservation – we had to ask for them to act like this – but it is a tendency to self-preserve. I’d put this weakly on the side of learning self-preservation.
On the other hand, recent systems definitely seem motivated to create code that passes common validation methods. They do it successfully fairly often, even in challenging situations (see the aforementioned long-term-project success rates). Furthermore, they’ve been noted to sometimes cheat on validation tests for code they found to hard to implement, and write code in such a manner that it continues to run even if it contains errors. So there seems to be a reasonably clear target – validatable, roughly specification compliant code – and they engage in creative problem solving to get there. We have markers of goal directedness. So, if we restrict our focus to coding, do we find their behaviour better predicted by broad claims about goal achievement, or do we need more specific analysis of the tasks they’ve been trained to do? Suppose they had a general desire to write validatable, somewhat spec compliant code. Then there are a bunch of other things we’d also expect them to do:
Convince users to give them easier specs, and quietly revise them down
Dissuade users from changing specs when they’re nearing a solution
Dissuade users from imposing strict validation requirements
But I’ve used these systems a lot and never encountered any of that. I’ve spent a lot more time telling the systems to simplify their plans than they have me. ChatGPT also fails to find documented examples of it. So even though we have clear markers of goal-orientedness emerging – including some pathological behaviours – we still have some big predictive misses if we start from sweeping statements about “what it is they really want”. To my mind, this is a somewhat stronger strike against the goal-directedness generalization thesis than agentic misalignment was a strike in favour.
There’s another issue here, related to why I think “goal-directedness generalization” is a bit of a dubious idea. We can ask: are coding assistants, which don’t manipulate users to give them easier tasks, less “generally goal-directed” than a hypothetical machine that did do this, or are they just as “generally goal-directed”, but with different goals? This problem is actually theoretically undecidable. But then, if there’s so much flexibility in the meaning of goal-directedness, why should we think goal-directedness is the right kind of goal directedness to cause the behaviour MIRI worries about? We could easily end up equivocating once again.
This brings us back to my original point: if we’re trying to reason about complex issues in an informal way, there are many opportunities for definitions to slip between our premises and our inferential steps. It’s all too easy to lead ourselves astray.
What’s to be done about this? I don’t think everyone to wait until they have a fully precise theory of risk before they say anything. Good theoretical work is hard, and AI is developing fast; if you see great risks on the horizon waiting for a fully satisfactory theory could easily mean waiting until after you need it. It’s fine to raise the alarm on the basis of suspicions that you’re not quite sure you can trust, but you still have a responsibility to mark out the gaps in your reasoning and to test if they can be patched as readily as you initially guessed.
Well I recently discovered it. It was published in Feb 2025 actually.
I think this is debatable, but I’ll accept it for the sake of argument here.
I’m speaking loosely about war and peace as a helpful shorthand. It’s possible that AI systems are under significant adversarial pressure in times of geopolitical peace or vise-versa.