Last month, OpenAI, the Elon Musk-founded artificial intelligence research lab, announced the arrival of the newest version of an AI system it had been working on that can mimic human language — a model called GPT-3.
I’ve now spent the past few days looking at GPT-3 in greater depth and playing around with it. I’m here to tell you — the hype is real. It has its shortcomings, but make no mistake: GPT-3 represents a tremendous leap for AI.
A year ago I sat down to play with GPT-3’s precursor dubbed (you guessed it) GPT-2. My verdict at the time was that it was pretty good. When given a prompt — say, a phrase or sentence — GPT-2 could write a decent news article, making up imaginary sources and organizations and referencing them across a couple of paragraphs. It was by no means intelligent — it didn’t really understand the world — but it was still an uncanny glimpse of what it might be like to interact with a computer that does.
A year later, GPT-3 is here — and it’s smarter. A lot smarter. OpenAI took the same basic approach it had taken for GPT-2 (more on this below), and spent more time training it with a bigger dataset. The result is a program that is significantly better at passing various tests of language ability that machine learning researchers have developed to compare our computer programs. (You can sign up to play with GPT-3, but there’s a waitlist.)
But that description understates what GPT-3 is, and what it does.
“It surprises me continuously,” Arram Sabeti, an inventor with early access to GPT-3 who has published hundreds of examples of results from the program, told me. “A witty analogy, a turn of phrase — the repeated experience I have is ‘there’s no way it just wrote that.’ It exhibits things that feel very much like general intelligence.”
Not everyone agrees. “Artificial intelligence programs lack consciousness and self-awareness,” researcher Gwern Branwen wrote in his article about GPT-3. “They will never be able to have a sense of humor. They will never be able to appreciate art, or beauty, or love. They will never feel lonely. They will never have empathy for other people, for animals, for the environment. They will never enjoy music or fall in love, or cry at the drop of a hat.”
Sorry, I lied — GPT-3 wrote that. Branwen fed it a prompt — a few words expressing skepticism about AI — and GPT-3 came up with a long and convincing rant about how computers won’t ever be really intelligent.
Branwen himself told me he was taken aback by GPT-3’s capabilities. As GPT-style programs scale, they get steadily better at predicting the next word. But up to a point, Branwen said, that improved prediction “just makes it a little more accurate a mimic: a little better at English grammar, a little better at trivia questions.” GPT-3 suggests to Branwen that “past a certain point, that [improvement at prediction] starts coming from logic and reasoning and what looks entirely too much like thinking.”
GPT-3 is, in some ways, a really simple program. It takes a well-known, not even state-of-the-art approach from machine learning. Fed most of the internet as data to train itself on — news stories, wiki articles, even forum posts and fanfictions — and given lots of time and resources to chew on it, GPT-3 emerges as an uncannily clever language generator. That’s cool in its own right, and has big implications for the future of AI.
How GPT-3 works
To understand what a leap GPT-3 represents, it would be helpful to review two basic concepts in machine learning: supervised and unsupervised learning.
Until a few years ago, language AIs were taught predominantly through an approach called “supervised learning.” That’s where you have large, carefully labeled data sets that contain inputs and desired outputs. You teach the AI how to produce the outputs given the inputs.
That can produce good results — sentences, paragraphs, and stories that do a solid job mimicking human language — but it requires building huge data sets and carefully labeling each bit of data.
Supervised learning isn’t how humans acquire skills and knowledge. We make inferences about the world without the carefully delineated examples from supervised learning. In other words, we do a lot of unsupervised learning.
Many people believe that advances in general AI capabilities will require advances in unsupervised learning — where AI gets exposed to lots of unlabeled data and has to figure out everything else itself. Unsupervised learning is easier to scale since there’s lots more unstructured data than there is structured data (no need to label all that data), and unsupervised learning may generalize better across tasks.
GPT-3 (like its predecessors) is an unsupervised learner; it picked up everything it knows about language from unlabelled data. Specifically, researchers fed it most of the internet, from popular Reddit posts to Wikipedia to news articles to fan fiction.
GPT-3 uses this vast trove of information to do an extremely simple task: guess what words are most likely to come next, given a certain initial prompt. For example, if you want GPT-3 to write a news story about Joe Biden’s climate policy, you might type in: “Joe Biden today announced his plan to fight climate change.” From there, GPT-3 will take care of the rest.
Here’s what GPT-3 can do
OpenAI controls access to GPT-3; you can request access for research, a business idea, or just to play around, though there’s a long waiting list for access. (It’s free for now, but might be available commercially later.) Once you have access, you can interact with the program by typing in prompts for it to respond to.
GPT-3 has been used for all kinds of projects so far, from making imaginary conversations between historical figures to summarizing movies with emoji to writing code.
Sabeti prompted GPT-3 to write Dr. Seuss poems about Elon Musk. An excerpt:
But then, in his haste,
he got into a fight.
He had some emails that he sent
that weren’t quite polite.
The SEC said, “Musk,
your tweets are a blight.
Not bad for a machine.
GPT-3 can even correctly answer medical questions and explain its answers (though you shouldn’t trust all its answers; more about that later):
So @OpenAI have given me early access to a tool which allows developers to use what is essentially the most powerful text generator ever. I thought I’d test it by asking a medical question. The bold text is the text generated by the AI. Incredible… (1/2) pic.twitter.com/4bGfpI09CL
— Qasim Munye (@QasimMunye) July 2, 2020
You can ask GPT-3 to write simpler versions of complicated instructions, or write excessively complicated instructions for simple tasks. At least one person has gotten GPT-3 to write a productivity blog whose bot-written posts performed quite well on tech news aggregator Hacker News.
Of course, there are some things GPT-3 shouldn’t be used for: having casual conversations and trying to get true answers, for two. Tester after tester has pointed out that GPT-3 makes up a lot of nonsense. This isn’t because it doesn’t “know” the answer to a question — asking with a different prompt will often get the correct answer — but because the inaccurate answer seemed plausible to the computer.
Relatedly, GPT-3 will by default try to give reasonable responses to nonsense questions like “how many bonks are in a quoit”? That said, if you add to the prompt that GPT- 3 should refuse to answer nonsense questions, then it will do that.
So GPT-3 shows its skills to best effects in areas where we don’t mind filtering out some bad answers, or areas where we’re not so concerned with the truth.
Branwen has an extensive catalog of examples of fiction writing by GPT-3. One of my favorites is a letter denying Indiana Jones tenure, which is lengthy and shockingly coherent, and concludes:
It is impossible to review the specifics of your tenure file without becoming enraptured by the vivid accounts of your life. However, it is not a life that will be appropriate for a member of the faculty at Indiana University, and it is with deep regret that I must deny your application for tenure….Your lack of diplomacy, your flagrant disregard for the feelings of others, your consistent need to inject yourself into scenarios which are clearly outside the scope of your scholarly expertise, and, frankly, the fact that you often take the side of the oppressor, leads us to the conclusion that you have used your tenure here to gain a personal advantage and have failed to adhere to the ideals of this institution.
Want to try it yourself? AI Dungeon is a text-based adventure game powered in part by GPT-3.
Why GPT-3 is a big deal
GPT-3’s uncanny abilities as a satirist, poet, composer, and customer service agent aren’t actually the biggest part of the story. On its own, GPT-3 is an impressive proof of concept. But the concept it’s proving has bigger ramifications.
For a long time, we’ve assumed that creating computers that have general intelligence — computers that surpass humans at a wide variety of tasks, from programming to researching to having intelligent conversations — will be difficult to make, and will require detailed understanding of the human mind, consciousness, and reasoning. And for the last decade or so, a minority of AI researchers have been arguing that we’re wrong — that human-level intelligence will arise naturally once we give computers more computing power.
GPT-3 is a point for the latter group. By the standards of modern machine-learning research, GPT-3’s technical setup isn’t that impressive. It uses an architecture from 2018 — meaning, in a fast-moving field like this one, it’s already out of date. The research team largely didn’t fix the constraints on GPT-2, such as its small window of “memory” for what it has written so far, which many outside observers criticized.
“GPT-3 is terrifying because it’s a tiny model compared to what’s possible, trained in the dumbest way possible,” Branwen tweeted.
That suggests there’s potential for a lot more improvements — improvements that will one day make GPT-3 look as shoddy as GPT-2 now does by comparison.
GPT-3 is a piece of evidence on a topic that has been hotly debated among AI researchers: Can we get transformative AI systems, ones that surpass human capabilities in many key areas, just using existing deep learning techniques? Is human-level intelligence something that will require a fundamentally new approach, or is it something that emerges of its own accord as we pump more and more computing power into simple machine learning models?
These questions won’t be settled for another few years at least. GPT-3 is not a human-level intelligence even if it can, in short bursts, do an uncanny imitation of one.
Skeptics have argued that those short bursts of uncanny imitation are driving more hype than GPT-3 really deserves. They point out that if a prompt is not carefully designed, GPT-3 will give poor quality answers — which is absolutely true, though that ought to guide us toward better prompt design, not give up on GPT-3.
They also point out that a program that is sometimes right and sometimes confidently wrong is, for many tasks, much worse than nothing. (There are ways to learn how confident GPT-3 is in a guess, but even while using those, you certainly shouldn’t take the program’s outputs at face value.) They also note that other language models purpose-built for specific tasks can do better on those tasks than GPT-3.
All of that is true. GPT-3 is limited. But what makes it so important is less its capabilities and more the evidence it offers that just pouring more data and more computing time into the same approach gets you astonishing results. With the GPT architecture, the more you spend, the more you get. If there are eventually to be diminishing returns, that point must be somewhere past the $10 million that went into GPT-3. And we should at least be considering the possibility that spending more money gets you a smarter and smarter system.
Other experts have reassured us that such an outcome is very unlikely. As a famous artificial intelligence researcher said earlier this year, “No matter how good our computers get at winning games like Go or Jeopardy, we don’t live by the rules of those games. Our minds are much, much bigger than that.”
Actually, GPT-3 wrote that.
AIs getting smarter isn’t necessarily good news
Narrow AI has seen extraordinary progress over the past few years. AI systems have improved dramatically at translation, games like chess and Go, important research biology questions like predicting how proteins fold, and generating images. AI systems determine what you’ll see in a Google search or in your Facebook Newsfeed. They compose music and write articles that, at a glance, read as if a human wrote them. They play strategy games. They are being developed to improve drone targeting and detect missiles.
But narrow AI is getting less narrow. Once, we made progress in AI by painstakingly teaching computer systems specific concepts. To do computer vision — allowing a computer to identify things in pictures and video — researchers wrote algorithms for detecting edges. To play chess, they programmed in heuristics about chess. To do natural language processing (speech recognition, transcription, translation, etc.), they drew on the field of linguistics.
But recently, we’ve gotten better at creating computer systems that have generalized learning capabilities. Instead of mathematically describing detailed features of a problem, we let the computer system learn that by itself. While once we treated computer vision as a completely different problem from natural language processing or platform game playing, now we can solve all three problems with the same approaches.
GPT-3 is not the best AI system in the world at question answering, summarizing news articles, or answering science questions. It’s distinctly mediocre at translation and arithmetic. But it is much more general than previous systems — it can do all of these things and more with just a few examples. And AI systems to come will likely be yet more general.
That poses some problems.
Our AI progress so far has enabled enormous advances, but also raised urgent ethical questions. When you train a computer system to predict which convicted felons will reoffend, you’re using inputs from a criminal justice system biased against Black people and low-income people — so its outputs will likely be biased against Black and low-income people, too. Making websites more addictive can be great for your revenue but bad for your users. Releasing a program that writes convincing fake reviews or fake news might make those widespread, making it harder for the truth to get out.
Rosie Campbell at UC Berkeley’s Center for Human-Compatible AI argues that these are examples, writ small, of the big worry experts have about AI in the future. The difficulties we’re wrestling with today with narrow AI don’t come from the systems turning on us or wanting revenge or considering us inferior. Rather, they come from the disconnect between what we tell our systems to do and what we actually want them to do.
For example, we tell an AI system to run up a high score in a video game. We want it to play the game fairly and learn game skills — but if it has the chance to directly hack the scoring system, it will do that to achieve the goal we set for it. It’s doing great by the metric we gave it. But we aren’t actually getting what we wanted.
One of the most disconcerting things about GPT-3 is the realization that it’s often giving us what we asked for, not what we wanted.
If you prompt GPT-3 to write you a story with a prompt like “here is a short story,” it will write a distinctly mediocre story. If you instead prompt it with “here is an award-winning short story,” it will write a better one.
Why? Because it trained on the internet, and most stories on the internet are bad, and it predicts text. It isn’t motivated to come up with the best text, or the text we most wanted; just the text that seems most plausible. Telling it the story won an award changes what text seems most plausible.
With GPT-3, this is harmless. And though people have used GPT-3 to write manifestos about GPT-3’s schemes to fool humans, GPT-3 is not anywhere near powerful enough to pose the risks that AI scientists warn of.
But someday we may have computer systems that are capable of humanlike reasoning. If they’re made with deep learning, they will be hard for us to interpret, and their behavior will be confusing and highly variable — sometimes seeming much smarter than human and sometimes much less so.
And many AI researchers believe that that combination — exceptional capabilities, goals that don’t represent what we “really want” but just what we asked for, and incomprehensible inner workings — will produce AI systems that exercise a lot of power in the world. Not for the good of humanity, not for vengeance against humanity, but toward goals that aren’t what we want.
Handing over our future to them would be a mistake — but one it’d be easy to make step-by-step, with each step half an accident.