A basic argument for AI risk

Rohin Shah writes (referenced here):

Currently, I’d estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I’d think wasn’t clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded. EDIT: I looked back and explicitly counted – I ran it with at least 19 people, and 2 succeeded: one gave an argument for “AI risk is non-trivially likely”, another gave an argument for “this is a speculative worry but worth investigating” which I wasn’t previously counting but does meet my criterion above.)

I thought this was surprising, so here is an attempt, time-capped at 45 mins.

1. The concern is valid in the limit: An entity of arbitrary, God-like intelligence would be very scary.

An entity of arbitrary, God-like intelligence would be very scary. We already see that the most intelligent humans, like von Neumann and the other Hungarian Martians, were able to perform feats of science that steered humanity’s fate.

If we had an artificial system that was as intelligent as von Neumann, this would be very scary, because it might be able to perform similar feats of science and engineering. Also, Moore’s law is uncertain, but it seems likely that in a generation we could go from having just one artificial von Neuman to having many of it. This increases the stakes.

2. Year by year we are approaching the limit: What comes after Comprehensive AI Systems?

We can think about how the world would look like in 2030-2035 if many tasks are delegated to successors of GPT-3. In this world, interfaces between many services and AI systems exist, and these are used to affect real-world systems. For example, I might routinely use GPT-3 to edit my texts, translations might be performed in real time through speech recognition, or automatic threat detection and deployment might be used by the world’s military.

Then we can think about, well, how does the world look like the decade after that one, and the decade after, and so forth. And it looks like ML systems generally acquiring more and resposability around managing human systems.

And in general, we might worry that as current systems become more and more capable, they might eventually exceed humans and start to manifest some of the dangers that a being of God-like intelligence would also display. Crucially, we think that the human brain is made out of human atoms and so in principle replicatable in silico.

3. Alignment proposals are uncertain, shaky and untested

In this situation, we would love to have some mathematical proof that these AI systems which might end up making important decisions are, in some sense, friendly to humans. We don’t have that guarantee. We might hope that there might be a strong incentive to create systems that increase human flourishing rather than reduce it.

But on the other hand, if we look at e.g., the algorithms driving social media and reducing attention spans, this doesn’t really inspire much confidence. We can also look at the organizing structure of society and notice that it alredy falls prey to Goodhart’s law in that maximizing profit does not maximize flourishing. Some examples might be the Sumangali system, Nestlé stealing water to bottle in a drought, or lobbying from weapon manufacturers making war more likely. We might worry that similar dynamics might exist if more and more decisions are put in the hands of AI systems and these are programmed to maximize something that at first looks like human flourishing but ultimately doesn’t.

We could also worry about more speculative failure modes, in which very intelligent systems at first appear to be helpful but then stop being so as they become harder to stop. We see that these kinds of things happen with reward hacking in curent models, and pop up in models of very intelligent systems.

4. Expected value calculations lead the way

Because we expect these systems to be tightly integrated with the human economy, we would like to have guarantees. But we don’t have them. Given that we don’t have guarantees that bad outcomes won’t happen, they might in fact happen. This would be bad.

If we multiply the number of people who we would expect to be affected by future AI systems, this would be a large number. This is similar to how e.g., the US Federal Reserve sets monetary policy in the US, and so improving its decisions would reverberate across the US. But because AI systems will have a large scale impact, steering that impact would also be valuable. If we multiply out that impact (e.g., here), the expected value turns out to be higher than other oportunities.