Way back in 1968, a program named SHRDLU was designed to stack blocks in a virtual world, and it interacted with humans in the following way:
Person: Will you please stack up both of the red blocks and either a green cube or a pyramid?
Computer: OK.
Person: Which cube is sitting on the table?
Computer: THE LARGE GREEN ONE WHICH SUPPORTS THE RED PYRAMID.
Person: Is there a large block behind a pyramid?
Computer: YES, THREE OF THEM: A LARGE RED ONE, A LARGE GREEN CUBE, AND THE BLUE ONE.
The initial amazement people experienced when they witnessed the natural language processing (NLP) capabilities and “intelligence” of interactive fiction programs like SHRDLU and ELIZA, is very similar to the excitement generated by OpenAI’s Generative Pre-trained Transformer 3 (GPT-3).
How GPT-3 captured people’s imagination
- Summarizing data on a page to answer a question and simplifying legal documents.
- Detecting pattern sequences to auto-fill data in Excel and generating color-scales.
- Generating passages or poetry when prompted with a sample sentence.
- Generating website front-end layouts and SQL queries from plain English.
- Used as a search engine that can also be trained to say “I don’t know” to bizarre questions.
- Language translation, basic arithmetic, unscrambling words and learning to utilize new words.
While creative demonstrations of GPT-3 are indeed impressive and have garnered compliments such as “mind-blowingly good”, the CEO of OpenAI, Sam Altman, advises that it be taken with a pinch of salt:
“The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.”
– Sam Altman, CEO of OpenAI
So what is the “secret sauce” of GPT?
The transformer model is the crucial breakthrough that utilizes a series of specialized neural networks to capture patterns of words using formulae that generate matrices of numbers.
These numbers are probability values that become part of a language model (LM) which represents how strongly the words are associated with each other. A rough illustration is shown in the image below, where the darker colors represent larger numbers in the matrix, indicating stronger associations.
These numbers (called “attention scores”) capture the context and style of sentences. This attention model allows evaluating the entire sentence together, which allows quicker calculations and parallel computing. GPT stores the generated context as “neural weights” (numbers representing the strength of association between nodes of neural networks). When given words as an input, these generalized neural weights help GPT produce a corresponding text output.
Can matrices of numbers understand language?
Traditionally, NLP required a lot of manual annotation, rule-creation and fine-tuning, since the context of words in a sentence were often located in prior sentences or unclear. For example, in the sentence “The robot nudged the apple since it was small“, the words “it” and “small” could refer to the apple or the robot. Just like SHRDLU, even GPT-3 understands nothing about language. It merely captures patterns in sentences. Here’s how GPT evolved:
GPT-1
The first version of GPT uses a slightly modified transformer model. The various tasks it was required to perform (shown in the image below) required specialized architectures and human-supervised fine-tuning.
GPT-2
GPT-2 did away with the need for creating fine-tuning architectures. To understand it, consider how English speakers know that “We all enjoys comedy movies” is incorrect grammar, even though they may not be able to explain why it is incorrect (“we” is a plural pronoun).
Encountering correct grammar repeatedly, is what creates such memory. Researchers realized that since sentences contain an inherent structure, the attention concept could similarly “memorize” and understand tasks without human supervision, simply with exposure to many good quality sentences. For example, the sentence “Translate ‘hello’ from German to Latin” mentions the input as German, the output as Latin and the task as a translation. Using 1.5 billion neural weights and a massive 40 gigabyte (GB) dataset of high-quality text, GPT-2 ‘s results were impressive.
It generalized language sufficiently enough to generate news articles, performed language translation and answered questions like “Who wrote the book ‘the origin of species’“.
GPT-3
Encouraged by the fact that larger datasets improved the model’s learning, researchers tweaked GPT-2 with modified attention layers, to create GPT-3 and trained it on 570 GB of text sourced from books and the internet. It took 175 billion neural weights (parameters) to capture this data.
GPT-3 performed relatively well on specific tasks like language modeling, translation, commonsense reasoning and answering questions. However, natural language inferencing (understanding the relation between sentences) benchmarks partly exposed GPT-3’s weaknesses that caused word repetition, lack of continuity and contradictions when generating sentences.
More importantly, it possesses no comprehension of how things work in the real-world (it is unable to answer questions like “will cheese melt in the fridge?“). Besides, the sheer size of the model makes it expensive to train for thousands of petaflop/s-days, makes it prone to bias and is in general, inconvenient as an AI system.
To conclude
Although the research paper on GPT-3 indicates that it could be tailored to cater to specific use-cases, OpenAI (and AI research in general) has a very long way to go before curating relevant training content and testing it on domains like speech recognition or transcription.
In these domains, Rev’s machine learning algorithms are time-tested, customer-validated and we still stand strong as the world’s best speech recognition engine.