Home Analytics Essential Pandas Operations You Must Bookmark Right Away –

Essential Pandas Operations You Must Bookmark Right Away –

by datatabloid_difmmk
machine-learning-classifier-nlp-interview-questions

This article data science blogthon.

prologue

I spent most of my time cleaning and preparing W data science projects. While doing various preprocessing approaches, you may find some problems that can be solved mainly by one library (Pandas). Pandas is a well-known Python module that handles everything from data preprocessing to data analysis. Pandas’ extensive feature set allows the user to get the job done significantly faster than his traditional counterpart.

panda
Source: Image by ThisIsEngineering on Pexels

In this article, we’ll take a look at some of the bookmark-worthy operations of Pand, both simple and powerful. These operations are mostly generic and can be modified depending on your use case.

Operations in pandas

1. To replace NaN values ​​with random values ​​from a list

To replace NaN values ​​in a Pandas DataFrame, you would normally use .fillna() Add a method to the DataFrame. To fill these NaN values ​​randomly from a list of numbers (including both floats and ints) or strings, use .loc() Add a method to the DataFrame.

Let’s take a look at the example below.

Library import

import pandas as pd
import numpy as np

Create Pandas DataFrame with Dummy Data

df = pd.DataFrame({
   'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
   'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
   'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
   'Salary': [12134343, 21312324, 421324554, 234434325]
})

code output

After running this code, you will get a Pandas DataFrame as attached in the image above. In this dataset, I intentionally added some NaN values ​​as this operation needs to process them.

Setting the seed value:

Let’s set a seed value to replace NaN values ​​with random data generated from the NumPy library. I want the same result set every time I run this code.

np.random.seed(124)

Replacing NaN Values

Now that we have our seed set, we use: .loc() Add methods to the DataFrame to perform operations.

df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]

Doing this assigns the list of values ​​to the NaN values ​​in the column ‘favorite sport’This list randomly selects a value from the following list. [‘Volleyball’, ‘Football’, ‘Basketball’, ‘Cricket’]the number of values ​​selected is equal to the number of NaNs in the selected column (here Favorite Sport).

put it all together

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
   'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
   'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
   'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
   'Salary': [12134343, 21312324, 421324554, 234434325]
})
np.random.seed(124)
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
print(df)

After running this code, the final dataset looks like this:

pandas code output

As you can see, at indices 0 and 3, NaNs have been replaced with randomly chosen values. ‘basketball’ When ‘volleyball’ From the sports list above.

2. To map categorical column values ​​to codes

Mapping values ​​to numeric codes is a useful and convenient method when you need numeric data in a DataFrame, but it must be unique and related to something else. One of her uses for this feature is to automatically assign role numbers to classes from a list of student names.

Let’s start by repeating the prerequisites from the previous step. importing the library, generating a Pandas DataFrame, and (1.) executing the operation.

Run prerequisites

# Importing the libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
   'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
   'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
   'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
   'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]

Create a list of codes (from the “Name” column)

list(pd.Categorical(df['Name'], ordered = True).codes)

When I run this I get:

Here, we use Pandas’ Categorical() method to ‘name’ A column of a DataFrame.also passed the value ‘truth’ to the parameter ‘ordered,’ So you get a list of numbers based on alphabetical order ‘name’ digit.name is “Alex”the assigned code is ‘0’for the name “Jimmy” The code assigned is ‘2’ as the name suggests “Jimmy” Ranked 3rd out of 4 ‘name’ Columns, alphabetically. I passed this whole code to a list to get a list of values.

We can also pass this list of values ​​as columns to a DataFrame.

Create new columns from code

df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)

Running this will create a new column named ‘roll number. ‘

Putting this all together

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
# Mapping 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
print(df)

After running this code, the DataFrame looks like this:

pandas code output

3. How to format integers in DataFrame

This process helps improve the readability of numbers for users. It is common to encounter numbers with a large number of digits in a DataFrame, which causes confusion and misunderstanding.

The following example formats values ​​in the Salary column.

Let’s start by completing the requirements for the main operations: importing the library that builds the Pandas DataFrame and the previous two operations.

Run prerequisites

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
# Mapping 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)

Formatting the Salary Column

df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))

put it all together

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
# Mapping 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
# Format values in 'Salary' column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
print(df)

Running this code gives the following results:

pandas code output

where each value is ‘salary’ built-in columns format() method, . application() panda way.

A potential caveat when performing this operation is that there are commas between the numbers, so the value becomes an object type or categorical when formatting integers.

4. To extract rows if a specific category column has a specific substring

Sometimes you want to remove rows that meet certain requirements. This operation is often performed on categorical columns of a DataFrame. Do a similar operation with one of the following categorical columns.

In the DataFrame, extract all rows where the person plays ball as his favorite sport. Use the Favorite Sport column to carry out this process.

We start with the prerequisites, such as importing the library, building the DataFrame, and previously completed operations.

Run prerequisites

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
# Mapping the 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
# Format values in the 'Salary' column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))

Extracting Rows of Interest

print(df[df['Favourite Sport'].str.contains('ball')])

Running this will extract all lines where the text is in a person’s favorite sport ‘ball’ Initialization.

put it all together

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
# Mapping the 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
# Format values in the 'Salary' column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
# Checking if 'ball' is in the 'Favourite Sport' column
print(df[df['Favourite Sport'].str.contains('ball')])

Running this code gives the following results:

pandas code output

Here I got all rows from a DataFrame with substrings as described ‘ball’ in a row Favo.

Conclusion

In this article, we’ll explore four simple yet powerful Pandas operations that can be used in a variety of situations. All described actions are performed in the simplest possible way. However, there may be other ways to perform these operations. This simplicity saves time looking for similar solutions on StackOverflow, so it’s worth bookmarking this article.

Important points:

  • We have seen how to replace NaN with a random value (number or string).
  • We also saw how to code strings to numbers based on their alphabetical placement.
  • In the third operation, we learned how to format integers and make them easier for the user to read.
  • I’ve also verified that formatting this changes the data type of the column from int to str .
  • In the fourth operation, we figured out how to extract rows when a certain substring is found in any of the specified columns.

connect with me LinkedInCheck out my other articles here.

Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.

You may also like

Leave a Comment

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

About Us

We're a provider of Data IT News and we focus to provide best Data IT News and Tutorials for all its users, we are free and provide tutorials for free. We promise to tell you what's new in the parts of modern life Data professional and we will share lessons to improve knowledge in data science and data analysis field.

Facebook Twitter Youtube Linkedin Instagram

5 Strategies To Reduce IT Support Tickets – Ultimate Guide

Recent Articles

Redefining the Role of IT in a Modern BI World What (Really) Are Issues Faced by Data Scientist in 2022 How I start Data Science Projects | What to do when you're stuck

Featured

5 Strategies To Reduce IT Support Tickets – Ultimate Guide Redefining the Role of IT in a Modern BI World What (Really) Are Issues Faced by Data Scientist in 2022

Copyright ©️ All rights reserved. | Data Tabloid