Introduction
As everyone knows, natural language processing is one of the most competitive and hot fields in todays global tech sector. NLP interview questions act as an important way to gauge potential employees and also help professionals stay in touch with technological advancements in the field.
Chat-GPT and a lot of AI applications nowadays are being driven out from NLP based technologies. This area is growing and also the need for the jobs.
As a data scientist working in the field, I have had opportunity to work on Language models and build Document classifier which became innovative project of the year. In this article I will share some of the interesting questions that I came across and how it helped me improve my knowledge.
Natural language processing is the area in linguistics to understand and process the natural language/human language by computing systems. So, it is one of the sub-branches of computer science also.
Natural language processing is broadly classified into two categories- Natural Language Understanding (NLU) and Natural Language Generation (NLG). Natural language understanding is the area of process, understanding the natural language, and taking its contextual meaning.
Natural language generation is the domain of generating new language terminologies (from characters up to passages) by the computing systems themselves.
If we tried to take a broad view of the Natural Language Processing domain, we would realise that many other topics and concepts are engaged in this field.
Now, the majority of interviewers prefer candidates with a solid command of the fundamentals of NLP over those with an only cursory familiarity with the subject.
In the past I had created an article that covers 8 interview questions that every data scientist should know, I will highly recommend that in conjunction with this article. Along with that, I will also recommend going over the best NLP libraries article in 2023.
I have created this article to help candidates ace their NLP related interviews and also data scientists to improve their knowledge of NLP.
So, let us begin with interview questions and their answers
Interview Questions on NLP Fundamentals
What is Tokenziation, Lemmatization and Stemming and how would you implement it in Python?
Tokenization: It is the process of splitting a sentence or a document into smaller units, such as words or phrases. This is an important step for NLP tasks such as text classification and sentiment analysis.
Lemmatization: It is the process of converting a word to its base form or root form, known as the lemma. This is done to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Example: Running, ran and run are the different forms of the same word, ‘run’. The lemma of this word would be ‘run’.
Stemming: Process of breaking down words with similar meaning/names into one word. Stemming is a natural language processing method that removes inflexions from the word and makes its basic forms. An example will be connections, connectivity, connected will all be stemmed into one word connect.
Example: The runners ran through the streets and the children played games
Stemmed words: [‘the’, ‘runner’, ‘ran’, ‘through’, ‘the’, ‘street’, ‘and’, ‘the’, ‘children’, ‘play’, ‘game’]
Lemmatized words: [‘The’, ‘runner’, ‘ran’, ‘through’, ‘the’, ‘street’, ‘and’, ‘the’, ‘child’, ‘played’, ‘game’]
The process of implementing lemmatization, tokenization and stemming is usually done by NLTK package in Python and codebase is shown below:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
#Tokenization
text = "This is an example of tokenization"
tokens = word_tokenize(text)
print("Tokenized words: ", tokens)
#Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
print("Lemmatized words: ", lemmatized_words)
#Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]
print("Stemmed words: ", stemmed_words)
What are the steps that you will take in text cleansing and how would you go about it?
- Text purification, also known as text pre-processing, is a crucial stage in natural language processing and text mining. Text cleaning consists of the following steps:
- Remove any special characters, punctuation marks, or symbols: Remove any special characters, punctuation marks, or symbols from the text, this is usually done using Regex and that will remove any special characters or punctuation marks
- Convert to lowercase: To avoid case-sensitivity concerns, convert all text to lowercase, this will usually be done by a function that converts text into lowercase.
- Remove stop words: Remove high-frequency words like “is,” “the,” “and,” and so on that have little semantic value. This is done by NLTK usually but it can also take in list of words that can be used as stop words.
- Remove numbers: Remove any digits from the text which could be done by Regex.
- Remove any additional white spaces from the text: Remove any unnecessary white spaces from the text.
- Tokenization is the process of dividing text into smaller parts, such as words or phrases as described in answer 1
- Stemming/Lemmatization: This is the process of reducing words to their root or basic form and differences between stemming and lemmatization are shown above.
To perform text cleansing in Python, you can use the libraries like NLTK and Spacy. Here is an example using the NLTK library:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# load the stop words
stop_words = set(stopwords.words("english"))
# text cleansing function
def text_cleansing(text):
# Remove special characters and punctuation
text = re.sub(r'[^ws]', '', text)
# Convert to lowercase
text = text.lower()
# Tokenization
tokens = word_tokenize(text)
# Remove stop words
tokens = [word for word in tokens if word not in stop_words]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
return lemmatized_words
text = "This is an example of text cleansing. Removing punctuation, converting to lowercase, removing stop words and lemmatizing the text."
cleansed_text = text_cleansing(text)
print("Cleansed text: ", cleansed_text)
What is sentiment analysis and can you explain how would you do sentiment analysis?
Sentiment analysis is the way to see what people are talking about your product or service offerings and give information on what can be improved on. It is used in stock market to understand the sentiments of the market along with customer reviews to understand customer sentiments of a product and/or service.
There are two main approaches to perform sentiment analysis:
- Rule-based approach: This approach uses a set of predefined rules and lexicons to classify the sentiment of the text. For example, a lexicon could include a list of positive and negative words, and the rule-based approach would count the number of positive and negative words in the text and make a classification based on which category has more words.
- Machine learning approach: This approach trains a machine learning model on a labeled dataset of text and sentiment, and then uses the trained model to classify the sentiment of new, unseen text. The machine learning approach can be further divided into two categories: supervised and unsupervised learning
How would you create a text classifier which will read text and identify themes and commonalities between the text?
- First step is to follow cleansing activity as shown in question 2 which is about cleaning the data.
- Second step is to convert the text into numerical representations such as bag of words or TF-IDF.
- Third step is about creating feature extraction which involves extracting the most common features, these could be top 100 or top 300 most common words which are most frequent words, bigrams or trigrams or can be created using word embedding such word2vec or BERT.
- Fourth step is Model selection and training: This involves selecting an appropriate machine learning model, such as a Naive Bayes classifier, SVM, or neural network, and training the model on the preprocessed and feature-extracted text data.
- Prediction: Finally, once the model has been trained and evaluated, it can be used to classify new, unseen text and identify themes and commonalities in the text.
- Lastly Evaluation: This involves evaluating the performance of the model on a held-out validation set to determine the accuracy, precision, recall, and F1 score of the model.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
# Load the movie review data
data = pd.read_csv("movie_reviews.csv")
# Preprocessing
vectorizer = CountVectorizer(stop_words='english')
text_data = vectorizer.fit_transform(data['review'])
# Feature extraction
X = text_data
y = data['sentiment']
# Model selection and training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = MultinomialNB()
model.fit(X_train, y_train)
# Evaluation
accuracy = model.score(X_test, y_test)
print("Accuracy: ", accuracy)
What are the best chatbots out there and what do they use in the background?
The optimal chatbot for a specific use case is determined by a number of criteria, including the chatbot’s intended function, target audience, and development and deployment resources. However, some prominent chatbots and the technology that power them include:
- OpenAI’s GPT-3: One of the largest and most powerful language models, OpenAI’s GPT-3 is capable of handling a wide variety of natural language processing tasks such as text synthesis, dialogue, and translation. GPT-3 is designed with a transformer architecture and trained on a big text dataset.
- Amazon’s Alexa: Amazon’s Alexa is a prominent virtual assistant that provides a conversational interface for consumers by utilising deep learning algorithms and natural language processing. The Alexa Voice Service, which employs machine learning algorithms, powers Alexa.
- Google Assistant: Google Assistant is a virtual assistant developed by Google that provides a conversational interface for users by utilising natural language processing and machine learning. Google Assistant is built into a variety of devices, such as smartphones, smart speakers, and smart displays.
- Apple’s Siri: Siri is a virtual assistant for Apple devices that responds to user queries using natural language processing and machine learning. Siri is built into a variety of Apple products, including iPhones, iPads, and Macs.
- Facebook’s M: Facebook’s M was a virtual assistant that was included within the Messenger app. To reply to user queries, M employed a combination of rule-based systems and artificial intelligence, including machine learning. Despite the fact that M was ended in 2018, it still worked nicely.
Deep learning techniques, natural language processing, and machine learning are among the technologies used to create these chatbots. The precise technology chosen will be determined by the intended purpose and target audience of the chatbot, as well as the resources available for its development and implementation.
Explain TF-IDF in simple words?
The TF-IDF (Term Frequency – Inverse Document Frequency) statistic measures the relevance of a term in a document within a collection of documents. It takes two elements into account:
Term Frequency (TF): The frequency with which a term appears in a document. The more a word occurs, the more often it appears.
Inverse Document Frequency (IDF): This metric determines how uncommon a word is across all documents in a collection. The higher the inverse document frequency of a term, the rarer it is.
A word’s TF-IDF score in a document is calculated by multiplying its term frequency by its inverse document frequency. The resultant score represents the relevance of the term in the paper within the context.
TF-IDF is extensively used in natural language processing and information retrieval tasks such as text classification and document retrieval to extract and rank the most relevant words from a collection of documents.
What are best practices in chatbot building?
I have explained in an article in more details on how to make chatbot DevOps ready, that has some very critical information in it on how to make a chatbot DevOps ready but shown below are the steps that you would take to create a chatbot
- Define the aims and target audience for the chatbot: Before beginning the development process, it is critical to establish what the chatbot is supposed to perform and who it is intended to serve.
- This will influence decisions on the features and functionality of the chatbot, as well as its overall design and user experience.
- Select a development platform: There are several platforms and tools for building chatbots, including cloud-based services and open-source frameworks.
- Dialogflow, Botpress, and Microsoft Bot Framework are other famous systems. I have personally worked with Dialogflow and Amazon Lex and can say that Dialogflow is the best but Amazon Lex has its strength in some areas.
- Create the conversational flow: Next, create the conversational flow for the chatbot, which governs how the chatbot will interact with people.This entails writing a script or a sequence of actions that the chatbot will employ to converse with users.
- Install the chatbot into your webpage or integrate it where you can access it
- Validate the conversational flow and information and see whether it flows through correctly and information that is getting is correct. As mentioned in DevOps ready chatbot article, I cover this information in more detail
What is word-to-vector and how would you use that?
Word-to-vector is a natural language processing (NLP) approach that converts words into numerical vectors. This format is intended to capture the meaning and context of words in a text as well as to allow computational analysis and manipulation of text data.
Each word is represented as a fixed-length vector of integers in a word-to-vector model, generally with hundreds or thousands of dimensions. Using techniques like as co-occurrence analysis or neural networks, the values in these vectors are generated based on the correlations between words in the text.
The generated vectors capture the meaning of words as well as their connections, allowing operations such as similarity comparison and grouping to be performed.
Word-to-vector models are frequently employed in natural language processing applications such as text categorization, information retrieval, and machine translation.
For resulting vectors capture the meaning of words and the relationships between them, enabling operations such as similarity comparison and clustering. Word-to-vector models are widely used in NLP tasks, including text classification, information retrieval, and machine translation.
They can, for example, be used to compare the similarity of two text documents, to cluster similar words, or to encode text data for use in machine learning models.
To employ a word-to-vector model, you must first pre-train it on a large corpus of text data using an approach such as word2vec or GloVe. The resultant model is then utilised to encode your target text data as numerical vectors, which may subsequently be used as input to different NLP tasks and models.
Pre-trained models from libraries such as spaCy, gensim, and the Hugging Face Transformers library are among the most popular word-to-vector models.Top of Form
How would you go about creating a word-cloud in Tableau or another similar BI tool?
- Prepare the text data: This is where I ran through question 2 which explains on how to clean text for analysis, which involves removing special characters, stop words and putting the words into lower or upper case for it to be ready for it to be digested
- Connect text date to Tableau: In this step you would connect data to Tableau and your text data source, either by importing the data into Tableau or by connecting to a database that contains the data.
- Create a calculated field in which you would split the text using SPLIT function in Tableau which allows you to split text into separate words.
- Create a frequency table which will contain the number of times each word occurred in text
Last you use visualizations to create bar chart initially to get frequency of words and then select words visualization, in the size section of the chart you would drag the number of occurrences which will increase or decrease the size of the text.
This video shows how to go about creating word-cloud in detail:
What are the steps you would take to improve your text-based ML model and how would you put the model into production?
Machine learning is an iterative process, so you would continually would want to follow the steps shown above on how to go about creating a text based ML model and iterate through all the steps shown in option 4, where you would create pipeline that would do pre-processing including text-cleansing and after that you would want the pipeline to use your existing model and continually improve it using new data which the model retrieves.
What packages would you recommend for NLP projects?
Read the article which I wrote which rates all the NLP libraries out there.
What is your favourite library for finding Named Entity Recognition?
The most famous Python library for NER or Named Entity recognition is spaCy. spaCy is an open-source library for advanced natural language processing in Python. It provides a fast and efficient way to perform NER, as well as many other NLP tasks, including tokenization, part-of-speech tagging, and dependency parsing.
What is Regex and give an example of where you would use regex?
Regex is short for Regular Expression. It is used as a pattern matching language available in just about all famous programming languages and is widely used to match pattern which can be used to search specific pattern of text or text within text data. Regex is widely used in many areas of computer science, including text cleansing, in data extraction, and web scraping.
An example of where you might use regex is to extract all the email addresses from a large text document. You could write a regex pattern that matches the structure of an email address, such as: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}.
This pattern would match any string that begins with one or more letters, numbers, or certain symbols, is followed by a “@” symbol, is followed by further letters, numbers, or special symbols, and eventually ends with a dot (.) and two or more letters. Keeping in mind – there is another logic for Gmail. On a side note – there is a Gmail hack which works on a lot of registrations, so this regex is not built for Gmail – it is bit more complicated: https://stackoverflow.com/questions/1874104/best-practices-for-email-address-validation-including-the-in-gmail-addresses
Write a Regex to figure out word with upper case starting with C example C in “Would you like to play with Cat or Chess” and put them both into a group and get Chess out of the group?
Here is a regex pattern that you can use to extract words starting with an uppercase “C” and grouping them into a match:
import re
text = "Would you like to play with Cat or Chess"
regex = r'bCw+'
match = re.findall(regex, text)
print(match) # Output: ['Cat', 'Chess']
print(match[1]) #Output: [‘Chess’]
What is stemming and what are different stemmers in natural language processing?
Stemming is a natural language processing method that removes inflexions from the word and makes its basic forms. The two major problems associated with the stemming process are “over stemming” and “under stemming”.
Over stemming occurs when two different words with same meaning are stemmed to the same root – also refer as false positive.
Under stemming occurs when two words that should be stemmed to the same root are not behaving correctly – also refer as false negative.
Example: The stemmed format of the words “connecting“,”connected“,”connections” is “connect“
There are some most popular stemmers are used in NLP. Those ar
1. Porter Stemmer – PorterStemmer()
Martin Porter created the Porter Stemmer or Porter algorithm in 1980. The technique uses five steps of word reduction, each with a unique set of mapping criteria. The original stemmer, the Porter Stemmer, is well known for its simplicity and speed. Often, the resultant stem is the shorter term with the same root meaning
2. Snowball Stemmer – SnowballStemmer()
Snowball Stemmer was also developed by Martin Porter. The technique used here is more accurate and is known as “English Stemmer” or “Porter2 Stemmer.” Compared to the first Porter Stemmer, it is a little quicker and more rational.
-
- Lancaster Stemmer – LancasterStemmer()
Although Lancaster Stemmer is simple, it frequently yields results with excessive stemming. Over-stemming makes stems unintelligible or non-linguistic. So it is not widely used in critical NLP applications
-
- Regexp Stemmer – RegexpStemmer()
Regex stemmer uses regular expressions to identify morphological affixes. The regular expressions’ matches on substrings will be ignored.
Compare anaphora resolution, coreference resolution and cataphora resolution in detail?
Anaphora resolution, Cataphora resolution, and Coreference resolution are the different types of Entity resolution tasks in Natural Language Processing.
Anaphora Resolution is the problem of resolving what a pronoun or a noun phrase refers to.
In the example below, sentences 1 and 2 are utterances that come together to make a discourse.
Sentence 1: Pradeep traveled from Kerala to Delhi by bike
Sentence 2: He is a good rider.
Anaphora Resolution is the task of making computing systems infer that He in sentence 2 is Pradeep in sentence 1, as humans can understand quickly.
The task of locating all expressions in a text that refer to the same thing is known as Coreference resolution. It is a crucial step for much higher-level NLP tasks involving natural language understanding, like document summarization, question answering, and information extraction.
An open-source Python package called NeuralCoref is part of SpaCy’s NLP pipeline and is the best tool to handle coreference resolution.
Cataphora in linguistics refers to the usage of a phrase or term in a discourse that also alludes to a later, more particular phrase. A cataphor is an expression that comes before another whose meaning is determined by or specified by the latter.
The following phrase is a cataphora in the English language. The pronoun she (the cataphor) precedes the subject of this sentence, Rajeev, in the following sentence:
When she came to play, Rajeev left the ground
What is major difference between Hyponymy and homonymy in semantic analysis and can you provide some examples?
A term that is a specific instance of a generic term is called “hyponymy”. By using the analogy of class-object, they can be recognized.
Example: Animal is the hypernym, where the cat, cow, ox, and lion are hyponyms
The term “homonomy” describes two or more lexical terms that share the same spelling but have entirely different meanings
Example: The word Bank is the homonomy, with different homonyms such as a financial institute or the side of a river
What is the FOPL – First Order Predicate Logic in Natural Language Processing?
First-order logic is a strong language that expresses the link between the items as well as how information about the objects can be developed more easily. Or in other words, A method of knowledge representation in artificial intelligence is first-order predicate logic.
It is an extension of propositional logic. FOL is expressive enough to convey the natural language statements clearly like spoken language First-order logic makes the following additional world assumptions in addition to the factual assumptions made by propositional logic:
-
- Objects: Any object in the real world
-
-
-
- Unary relations like color, height, size, weight – a common example is usually some sort of measurement
Relationships:
- Unary relations like color, height, size, weight – a common example is usually some sort of measurement
-
- N-any relations like the sister of, father of, mother of and other information that can have multiple relationship
-
-
-
- Function: – Real-world functions such as a best friend of, most favorite, favorite color: FOPL has major two parts such as in Natural Language.
Syntax & Semantics Example
An example of FOPL is as shown. The FOPL form of the sentence “Le Bron James is the captain of LA Lakers” is Captain (Le Bron James, LA Lakers)
The FOPL form of the sentence “Yellow is the color of gold” is Color (Gold, yellow)
What are major differences between Synonymy and Antonymy in semantic analysis?
Synonymy is used when two or more lexical concepts with different possible spellings have the same or a similar meaning. Example: (job, occupation), (stop, halt)
A pair of lexical phrases with opposing meanings that are symmetric to a semantic axis is referred to as an Antonymy. Example: (Large, small), (black, white)
How to get definitions, hypernyms, hyponyms, synonyms and antonyms of a given phrase/word using python?
For getting the definitions, hypernyms, hyponyms, synonyms, and antonyms of a word/phrase we can use the NLTK library in python
Here is the code reference:
import nltk
from nltk.corpus import wordnet
def get_wordnet_definitions(word):
synonyms = []
antonyms = []
definitions = []
for syn in wordnet.synsets(word):
definitions.append(syn.definition())
for lemma in syn.lemmas():
synonyms.append(lemma.name())
if lemma.antonyms():
antonyms.append(lemma.antonyms()[0].name())
return definitions, synonyms, antonyms
definitions, synonyms, antonyms = get_wordnet_definitions("run")
print("Definitions: ", definitions)
print("Synonyms: ", synonyms)
print("Antonyms: ", antonyms)
Explain the Quantifiers in First Order Predicate Logic?
A linguistic element called a quantifier produces quantification, which describes the number of specimens in the discourse universe. These are the symbols that allow for the determination or identification of the variable’s range and scope in the logical expression. There are mainly two types of quantifiers in FOPL. Those are
-
- The Universal Quantifier (for all, everyone, everything) represents ∀
- The Existential quantifier (for some, at least one) represents ∃x
-
- If x is a variable, then ∀x is read as:
For all x
For each x
For every x.
…
Example: Quantifier representation of the sentence All boys like cricket is
∀x boy(x) → like (x, cricket)
-
- If x is a variable, then ∃x is read as:
There exists a ‘x.’
For some ‘x.’
For at least one ‘x.’
Example: Quantifier representation of the sentence Some girls like football is
∃x girl(x) ∧ like (x, football)
What are the major differences between Polysemy and Meronomy in semantic analysis?
Polysemy is the term for lexical concepts with numerous, closely related meanings but the same spelling. It is distinct from homonymy since homonymy does not require that the terms’ meanings be closely connected.
Example: The phrase “man” is a polysemy since it can signify many different things depending on the context, such as “the human species,” “a male human,” or “an adult male human.”
Where Meronomy is a relationship in which one lexical term is a component of another, more significant entity.
Example: “Keyboard” is a meronym for “computer.”
What is perplexity, and lower/higher perplexity is better for language models?
The level of “uncertainty” a model has while predicting (assigning a probability to) particular text can be captured by the perplexity measured in natural language processing. A lower perplexity score indicates better generalization performance.
So, a language model with a lower perplexity score is better than a language model with a higher perplexity score.
Perplexity Formula:
It is a per-word metric, and perplexity doesn’t affect by the sentence length.
The perplexity indicates the level of “randomness” in our model. If the perplexity (per word) is 3, the model had an average 1-in-3 chance of correctly predicting the following word in the text. It is referred known as the average branching factor for this reason.
BONUS
Are you aware of some popular JAVA-based NLP libraries besides python that are widely used in the data science industry today?
Python is the most simple and user-friendly language tool for dealing with Natural Language Processing applications. But still, there are some popular Java-based libraries also that developers used today.
Those are given as follows,
Conclusion
Natural Language Processing is one of the most trending domains today in the tech industry. In this blog, I tried to explain some advanced Natural Language Processing interview questions that usually ask today for NLP Engineer role positions in various companies.
Major points to remember
-
- Elements included in Semantic Analysis includes Hyponymy, Homonymy, Meronomy
-
- What are FOPL and its use in NLP
-
- What is perplexity, and how is it calculated
-
- Different stemmers in NLP
-
- Popular Java libraries in Natural Language Processing
Conclusion
I hope this article helped you to strengthen your natural language processing fundamental concepts. Feel free to leave a remark below if you have any questions, concerns, or recommendations.