prologue
“Hey, Siri,” “Hey, Google,” and “Alexa” are common voice assistants that we use every day. These engaging conversational bots use natural language understanding to understand your input. NLU is a subset of natural language processing that allows machines to understand natural language (text/speech).. NLU is a key component in most NLP applications, such as building machine translation, speech recognition, and chatbots. The foundation of NLU is a language model.
In this article, we discuss state-of-the-art language models from Open AI, GPT, and its variants, and how they led to ChatGPT’s breakthrough. Points covered in this article include:
- Learn about ChatGPT and its model training process.
- Understand a brief history of the GPT architectures (GPT 1, GPT 2, GPT 3, InstructGPT).
- Deep understanding of Reinforcement Learning from Human Feedback (RHLF).
let’s start!
Overview of the GPT family
The state-of-the-art architecture for language models is the transformer. The work of a transformer is nothing short of magical. OpenAI has come up with one such transformer, namely the Generative Pre-trained Transformer Model, commonly known as GPT.
GPT is developed in a self-managing manner. A model is trained on a large dataset to predict the next word in a sequence. This is known as casual language modeling. This language model is fine-tuned with supervised datasets for downstream tasks.
OpenAI has released three different versions of GPT (GPT-1, GPT-2, and GPT-3) to generate human-like conversations. The three versions of GPT differ in size. Each new version was trained by scaling up the data and parameters.
GPT-3 is called an autoregressive model and is trained to make predictions just by looking at past values. GPT-3 can be used to develop huge applications such as search engines, content creation, etc. But why couldn’t GPT-3 achieve human-like conversation?
Why use InstructGPT
There are two main reasons why GPT-3 failed.
One problem with GPT-3 is that the model’s output doesn’t match the user’s instructions/prompts. In other words, GPT-3 cannot generate responses that users prefer.
For example, given the prompt “Tell a six-year-old about the moon landing in a few sentences,” GPT-3 produced the undesirable response shown in the figure below. The main reason behind such responses is that the model is trained to predict the next word in the sentence. GPT-3 has not been trained to produce human-preferred responses.
Another problem is that you have no control over the text, which can generate dangerous and harmful comments.
To solve both of these problems (alignment and harmful comments), a new language model was trained that can address these challenges. More on this in the next section.
What is InstructGPT?
InstructGPT is a language model that generates user-preferred responses for the purpose of secure communication. Hence, this is known as a language model along the lines of: It uses a learning algorithm called Reinforcement Learning from Human Feedback (RLHF) to generate safer responses.
Reinforcement learning from human feedback It is a deep reinforcement learning method that incorporates human feedback into learning. A human expert controls the learning algorithm by providing the most probable human response from the list of responses generated by the model. This way the agent mimics a safe and honest response.
But why do reinforcement learning from human feedback? Why not a traditional reinforcement learning system?
Traditional reinforcement learning systems need to define a reward function to understand if the agent is moving in the right direction and aim to maximize the cumulative reward. However, it is very difficult to convey the reward function to the agent in modern reinforcement learning environments. So instead of defining a reward function for the agent, we train the agent to learn the reward function based on human feedback. In this way, the agent can learn the reward function and understand the complex behavior of the environment.
In the next section, we’ll learn about one of the most trending topics in the AI space: ChatGPT.
Introducing ChatGPT
ChatGPT is currently a hot topic in the data science space. ChatGPT is just a chatbot that mimics human conversation. Able to answer given questions and remember previous conversations. For example, when prompted with “decision tree code”, ChatGPT responded with a decision tree implementation in python, as shown in the image below. That’s the power of ChatGPT. Finally we will look at some more interesting examples.
According to OpenAI, ChatGPT is a sibling model of InstructGPT, trained to follow prompts and provide detailed responses. This is a modified version of InstructGPT with a modified model training process. It can remember previous conversations and respond accordingly.
Let’s see the difference between Instruct GPT and ChatGPT. Although it incorporates reinforcement learning from human feedback, InstructGPT is still toxic because it is not fully tuned. So this led to ChatGPT’s breakthrough by changing the data collection settings.
How is ChatGPT built?
ChatGPT is trained similarly to InstructGPT, with modified data collection. Let’s understand how each phase works.
In this first step, we fine-tune GPT-3 with a dataset containing pairs of prompts and associated answers. This is a supervised fine-tuning task. Relevant answers are provided by professional labelers.
The next step is to learn a reward function that helps the agent decide what’s right and what’s wrong and steer them in the right direction towards their goals. The reward function is learned through human feedback, ensuring that the model produces safe and truthful responses.
Here is the list of steps involved in the Compensation Modeling task:
- Multiple responses are generated for the given prompt
- The labeler compares the list of prompts generated by the model and ranks them from best to worst.
- This data is used to train the model.
The final step uses the Proximal Policy Optimization Algorithm (PPO) to learn the optimal policy for the reward function. PPO is a new class of reinforcement learning techniques introduced by Open AI. The idea behind PPO is to stabilize agent training by preventing policy updates from becoming too large.
Steps involved in model training
Source: https://openai.com/blog/chatgpt/
ChatGPT’s hilarious prompts
Now let’s take a look at some of the hilarious prompts generated by ChatGPT.
Prompt 1:
Prompt 2:
Prompt 3:
Conclusion
That’s the end of the article. In this article, we discussed ChatGPT and how it is trained using deep reinforcement learning techniques. We also provided a brief history of GPT variants and how they led to ChatGPT.
ChatGPT has been an absolute sensation in the history of AI, but it’s not the only way to achieve human intelligence. You can try ChatGPT here.
I hope you like the article. Let us know what you think and what you think of ChatGPT in the comments below.