Strong AI is a dangerous technology
AI safety has been discussed a fair bit from many points of view, and there are many arguments about how it could be addressed, when it should be addressed and how big the problem is. If you’re interested in any of this, check out Richard Ngo’s course. Many people have found it intuitively obvious for a long time that strong AI could be dangerous, and AI safety researchers have worked on making some of these intuitions more precise.
I’m interested in making a simple and general argument to think that strong AI could be dangerous. I think a polished and convincing argument of this nature could be quite valuable: it can be helpful to quickly explain “AI safety” to newcomers, and perhaps also to help understand at a high level different approaches to solving AI risk problems.
The level of polish of this document is a first draft.
Abstract
There is an incentive to create strong AIs that operate autonomously in a broad range of contexts
Strong AIs may be capable of evading attempts to control or alter their behaviour once activated
The consequences of their behaviour may be difficult to predict in detail before activation
Nevertheless, their behaviour may be predictably harmful to human interests
Strong AI is desirable because it can autonomously achieve complex objectives
Achieving complex objectives often requires the capability to perform a diverse range of tasks to a high standard, and the ability to get all of these different capabilities to “cooperate” in achieving the objective. A car with one critical component malfunctioning is much worse than a car with zero critical components malfunctioning. I think there are many examples of products and processes where removing a step or having a step out of sync with others severely degrades the overall performance (see the O-ring theory of economic development).
“Strong AI” refers, somewhat vaguely, to technology that is capable of performing a wide range of tasks and integrating the performance of these tasks into a plan for accomplishing a wide variety of complex objectives. Furthermore, it features a “control system” that allows an operator to relatively easily aim it at different objectives. As part of its definition, we assume it can perform these tasks and the integration more capably than existing technology-assisted human organisations.
Achieving and managing a wide variety of capabilities is difficult, and if “strong AI” that can accomplish this more reliably than existing approaches, I think it will be obviously highly desirable.
One way this line of argument could be weakened is if achieving “complex objectives” also involves performing and integrating a diverse range of value judgements which cannot be automated. One way this could happen is that, perhaps, people’s judgement of whether a project is “successful” depends on their having ongoing input into the direction of the project. In this case, from the point of view of an individual, there is no way a fully autonomous project can be “successful”, because they haven’t had enough ongoing input into it.
That objection doesn’t seem to undermine the principle that organisations employing strong AI could end up dominating the economy of the future, even if the people involved in these organisations feel less “successful” than they might if they were more involved in the operations, but I’m uncertain about this.
See also Gwern's discussion of the economic incentives for general purpose "agent AI"..
Broad capability to operate autonomously implies adversarial capability
There are many tasks where performing well involves overcoming other people who are trying to make you perform less well.
Applying for jobs
Selling things
Buying things
Influencing contested decisions
Making and resisting threats
While using strong AI to make some operations more autonomous may reduce the adversarial nature of some of these activities, some will remain adversarial. In fact, it seems more likely than not to me that it will be difficult to draw a clear line between “adversarial” and “non-adversarial” activities. Furthermore, even today’s weak AI arguably engages in some of these adversarial activities, e.g. buying and selling things, influencing contested decisions.
Thus strong AI is likely to be capable, among other things, of efficiently achieving some goals despite being opposed by some people in a range of contexts. In particular, this range of contexts could include the situation where its designers discover its behaviour is undesirable and seek to modify or shut it down.
One way this line of argument could be undermined is if adversarial capabilities are inherently less generalisable than most other capabilities. An AI may develop capabilities to counter many actions a person may take to undermine it, but this doesn’t guarantee that it develops capabilities to counter most or all actions a person may take to undermine it. In particular, strong AIs may develop robust capabilities to thwart “intended adversaries”, but not to thwart control attempts by designers.
Another way this argument could be undermined is if strong AIs play a large role in human control of other strong AIs, and so human capability to monitor and control AI increases along with the AI’s capability to avoid control. This seems to create a “chicken and egg” problem where control of strong AI is needed to ensure control of strong AI, but I’m uncertain about whether this works out to be a difficult problem in practice or not.
Broad capability is harder to understand than narrow capability
We may not know the full long-run consequences of using a self-driving AI in a particular car, but neither can the self-driving AI. We can’t rule out that, for example, getting where we want to go and feeling more refreshed when we get there turns out to be extremely impactful for some reason, but we can be confident that the car AI also doesn’t know this and won’t deliberately take this action based on its extreme impact. Thus we can think of the likelihood of extreme unexpected impacts in terms of how likely it is that the AI “randomly” chooses actions with these results.
Consider instead a self-driving AI installed in all the cars in a city, which doesn’t only try to get people where they’re going safely but also to optimise traffic flow. Traffic flow depends to on how many people choose to ride at a particular time. Delivering someone efficiently or inefficiently to their destination may have a large impact on their decision to ride again in the future; as before, we have a hard time predicting this. However, the traffic AI is likely to know better, and unlike the previous case it may take actions specifically to influence this outcome. Thus we’re presented with the problem of predicting the likelihood of extreme unexpected impacts when an AI may actually be trying to generate them.
The specific example of a traffic AI may be able to be dealt with, but it is also a fairly narrow AI. AIs with broader capabilities yet (“manage the transportation for a city”) raise the prospect of extreme “unknown unknown” impacts that the AI is actively trying to bring about.
This argument might be undermined by AI control systems for strong AI being robust to limited understanding of controllers in a way that control systems for weak AI are not. For example, control systems for strong AIs may identify and avoid behaviour that operators may care about but could not predict, or to inform operators of consequences that they do not expect. It seems quite difficult to create theoretical guarantees that control systems achieve objectives like this, but it’s less obvious to me whether in practice it is hard to create such control systems or not.
Strong AIs may be dangerous by default
A strong AI will, all else equal, be able to perform a wider variety of tasks with a higher chance of success if it has previously arranged to have a high level of control over its environment than if it has not. Changing the tax policy of a country is an almost impossible task for me. However, for Joe Biden to do the same it is difficult but not impossible.
Joe Biden controls many things that I don’t, and these things enable him to have a higher chance of success in a number of difficult projects.
Instrumental convergence is the name of the hypothesis that strong AIs are likely to perform a number of behaviours by default, whatever the purposes for their deployment are.
Not all proposed convergent behaviours are dangerous by themselves - increasing cognitive capabilities, for example, is only dangerous under the assumption that the AI with high cognitive capabilities is likely to engage in other dangerous behaviour.
However, some of the proposed convergent behaviours raise the prospect of creating conflict between the AI and people:
Resource acquisition: people may want the same resources
Goal-content integrity: people may want to alter harmful AI behaviour
Self-preservation: people may want to shut down a harmful AI
One way this argument might be undermined is if, in practice, conflict is a bad bet for strong AI, because the chance of losing is higher than the gain from winning. The same instrumental goals may arise, but they will not lead to dangerous behaviour. This depends on how capable of adversarial behaviour a strong AI is.
Another way this argument might be undermined is if control systems for strong AI are quite different to algorithmic utility maximisation, and the arguments for instrumental convergence do not apply. It seems unlikely to me that control systems would be sufficiently different to algorithmic utility maximisation, as I expect that robust autonomous behaviour directed towards particular goals is hard to achieve without approximating utility maximisation.
Summary
There is an incentive to create strong AIs that operate autonomously in a broad range of contexts
Strong AIs may be capable of evading attempts to control or alter their behaviour once activated
The consequences of their behaviour may be difficult to predict in detail before activation
Nevertheless, their behaviour may be predictably harmful to human interests
This doesn’t mean that people will eventually create dangerous strong AIs. A number of counterarguments have been noted above, and more could surely be thought of, as well as additional arguments for AI danger. However, the following seem like reasonable conclusions to me:
Strong AI is a dangerous technology, given what we currently know
The level of danger of an AI system is related to its level of capability, and the breadth of its capability
Creating safe strong AI systems seems to require solutions to problems that are not especially relevant to the control or usefulness of AI systems today (I’m less confident of this than the previous two points)