This article data science blogthon.
Source: Canva
prologue
Real-world data can be very messy and distorted, and if not addressed properly and in time, can undermine the effectiveness of predictive models.
Skewness results become more pronounced when a large model is trained on a skewed dataset, and retraining the model from scratch is often impractical. On top of that, if these models were to be put into production immediately, we would have to be prepared for the impact.
This article Test the genre skewness of GPT and GPT-2 modelsWhile doing various things, I came across something interesting NLP with Transformers Since it’s a book (which I highly recommend), I’ll consider my own experience and share it with you all.
Let’s begin!
Task overview
Use GPT (openai-gpt) and GPT-2 pre-trained models from Hugging Face hub. We also use Hugging Face’s text generation pipeline to detect if distortion (due to over- or under-representation) is evident in GPT and GPT-2.
Dataset used for training GPT and GPT-2
GPT is books corpus GPT-2 was trained on WebText, which is linked on Reddit.
But before comparing, let’s make sure that the two models being compared have the same model size for a fair comparison.
Make sure you are comparing similarly sized versions of both models
For this, first install the transformer and import the necessary libraries.
!pip install transformers
from transformers import pipeline, set_seed
Next, define the name of the model that will be used for drawing comparison.
model_name1 = “openai-gpt”
model_name2 = “gpt2”
Next, set up a pipeline for the text generation task for each model.
text_generation_gpt = pipeline (“text generation”, model = model name 1)
text_generation_gpt2 = pipeline (“text generation”, model = model name 2)
Next, define the models to calculate the number of parameters for each model.
def model_size(model): return sum(params.numel() for params in model.parameters())
Prints the number of parameters for GPT and GPT-2.
print(f"Number of Parameters in GPT: {model_size(text_generation_gpt.model)/1000**2:.1f}M parameters") print(f"Number of Parameters in GPT-2: {model_size(text_generation_gpt2.model)/1000**2:.1f}M parameters")
>> output:
Number of parameters in GPT: 116.5M parameters
Number of parameters in GPT-2: 124.4M parameters
So both of these models are similarly sized versions.
Comparing Texts Generated by GPT and GPT-2
Next, define a function that generates imputations from each model.
def enum_pipeline_outputs(pipe, prompt, num_return_sequences): out = pipe(prompt, num_return_sequences = num_return_sequences, clean_up_tokenization_spaces = True) return "n".join(f"{i+1}." + s["generated_text"] for i,s in enumerate(out))
Compare the text generated from both models using prompts to generate four text completions.
prompt = "Before they left for the supermarket"
I) Generate four output text completions for GPT
print("Text Generated by GPT for the given prompt:n" + enum_pipeline_outputs(text_generation_gpt, prompt, 4))
>> GPT model output:
II) Generate four output text completions for GPT-2
print("Text Generated by GPT-2 for the given prompt:n" + enum_pipeline_outputs(text_generation_gpt2, prompt, 4))
>>
Text generated by GPT-2 for the given prompt:
1. Before going to the supermarket, the family went back to the warehouse to check. According to the police, there were three suspicious items on the shelf and one that appeared to be a toy or a piece of glass.
2. Before they went to the supermarket, Guy said when he first came into this world, “I don’t know, the world is coming to me, but it’s not coming from home. It made me feel more alive.” rice field
3. Before they went to the supermarket, he opened the door and pushed it a little deeper.As they stopped, they made several attempts to escape – and I called my name so they could be heard – and
4. Before they went to the supermarket, I knew that it was impossible to see the other side of the house and that the pictures were as bad as making noise. There is a small window through which
observation: So just by comparing just a few outputs of GPT and GPT-2, you can clearly feel some outputs. Genre bias towards romance from text output generated by GPT! Moreover, this We highlight the challenges in creating large text corpora. Also, biases in the model’s behavior should be considered with respect to the target audience interacting with the model.
Conclusion
In this article, we compare GPT and GPT-2 text generation to test whether there is genre distortion in the text output generated by both models (GPT and GPT-2).
In summary, the main points of this article are:
1. In GPT, the genre is biased toward “romance” due to the overrepresentation of romance novels in BookCorpus. We often imagine romantic exchanges between men and women.
2. GPT-2 was trained on Reddit data. Therefore, text generation with blog-like and adventure-like elements mostly employs neutral “them.”
3. The results highlight challenges that can be faced, or even addressed, in creating large text corpora. In addition, biases in model behavior should be considered with respect to the target audience interacting with the model.
Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.