This article data science blogthon.
prologue
I spent most of my time cleaning and preparing W data science projects. While doing various preprocessing approaches, you may find some problems that can be solved mainly by one library (Pandas). Pandas is a well-known Python module that handles everything from data preprocessing to data analysis. Pandas’ extensive feature set allows the user to get the job done significantly faster than his traditional counterpart.
In this article, we’ll take a look at some of the bookmark-worthy operations of Pand, both simple and powerful. These operations are mostly generic and can be modified depending on your use case.
Operations in pandas
1. To replace NaN values with random values from a list
To replace NaN values in a Pandas DataFrame, you would normally use .fillna() Add a method to the DataFrame. To fill these NaN values randomly from a list of numbers (including both floats and ints) or strings, use .loc() Add a method to the DataFrame.
Let’s take a look at the example below.
Library import
import pandas as pd import numpy as np
Create Pandas DataFrame with Dummy Data
df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] })
After running this code, you will get a Pandas DataFrame as attached in the image above. In this dataset, I intentionally added some NaN values as this operation needs to process them.
Setting the seed value:
Let’s set a seed value to replace NaN values with random data generated from the NumPy library. I want the same result set every time I run this code.
np.random.seed(124)
Replacing NaN Values
Now that we have our seed set, we use: .loc() Add methods to the DataFrame to perform operations.
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
Doing this assigns the list of values to the NaN values in the column ‘favorite sport’This list randomly selects a value from the following list. [‘Volleyball’, ‘Football’, ‘Basketball’, ‘Cricket’]the number of values selected is equal to the number of NaNs in the selected column (here Favorite Sport).
put it all together
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] print(df)
After running this code, the final dataset looks like this:
As you can see, at indices 0 and 3, NaNs have been replaced with randomly chosen values. ‘basketball’ When ‘volleyball’ From the sports list above.
2. To map categorical column values to codes
Mapping values to numeric codes is a useful and convenient method when you need numeric data in a DataFrame, but it must be unique and related to something else. One of her uses for this feature is to automatically assign role numbers to classes from a list of student names.
Let’s start by repeating the prerequisites from the previous step. importing the library, generating a Pandas DataFrame, and (1.) executing the operation.
Run prerequisites
# Importing the libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
Create a list of codes (from the “Name” column)
list(pd.Categorical(df['Name'], ordered = True).codes)
When I run this I get:
Here, we use Pandas’ Categorical() method to ‘name’ A column of a DataFrame.also passed the value ‘truth’ to the parameter ‘ordered,’ So you get a list of numbers based on alphabetical order ‘name’ digit.name is “Alex”the assigned code is ‘0’for the name “Jimmy” The code assigned is ‘2’ as the name suggests “Jimmy” Ranked 3rd out of 4 ‘name’ Columns, alphabetically. I passed this whole code to a list to get a list of values.
We can also pass this list of values as columns to a DataFrame.
Create new columns from code
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
Running this will create a new column named ‘roll number. ‘
Putting this all together
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes) print(df)
After running this code, the DataFrame looks like this:
3. How to format integers in DataFrame
This process helps improve the readability of numbers for users. It is common to encounter numbers with a large number of digits in a DataFrame, which causes confusion and misunderstanding.
The following example formats values in the Salary column.
Let’s start by completing the requirements for the main operations: importing the library that builds the Pandas DataFrame and the previous two operations.
Run prerequisites
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
Formatting the Salary Column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
put it all together
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes) # Format values in 'Salary' column df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d')) print(df)
Running this code gives the following results:
where each value is ‘salary’ built-in columns format() method, . application() panda way.
A potential caveat when performing this operation is that there are commas between the numbers, so the value becomes an object type or categorical when formatting integers.
4. To extract rows if a specific category column has a specific substring
Sometimes you want to remove rows that meet certain requirements. This operation is often performed on categorical columns of a DataFrame. Do a similar operation with one of the following categorical columns.
In the DataFrame, extract all rows where the person plays ball as his favorite sport. Use the Favorite Sport column to carry out this process.
We start with the prerequisites, such as importing the library, building the DataFrame, and previously completed operations.
Run prerequisites
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping the 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes) # Format values in the 'Salary' column df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
Extracting Rows of Interest
print(df[df['Favourite Sport'].str.contains('ball')])
Running this will extract all lines where the text is in a person’s favorite sport ‘ball’ Initialization.
put it all together
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping the 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes) # Format values in the 'Salary' column df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d')) # Checking if 'ball' is in the 'Favourite Sport' column print(df[df['Favourite Sport'].str.contains('ball')])
Running this code gives the following results:
Here I got all rows from a DataFrame with substrings as described ‘ball’ in a row Favo.
Conclusion
In this article, we’ll explore four simple yet powerful Pandas operations that can be used in a variety of situations. All described actions are performed in the simplest possible way. However, there may be other ways to perform these operations. This simplicity saves time looking for similar solutions on StackOverflow, so it’s worth bookmarking this article.
Important points:
- We have seen how to replace NaN with a random value (number or string).
- We also saw how to code strings to numbers based on their alphabetical placement.
- In the third operation, we learned how to format integers and make them easier for the user to read.
- I’ve also verified that formatting this changes the data type of the column from int to str .
- In the fourth operation, we figured out how to extract rows when a certain substring is found in any of the specified columns.
connect with me LinkedInCheck out my other articles here.
Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.