Most Important PySpark Functions with Example

by datatabloid_difmmk
PySpark

Let’s see the actual implementation:-

Example:- A.) Concatenate one or more columns using expr()

# concate friend's name, age, and location columns using expr()
df_concat = df_friends.withColumn("name-age-location",expr("friends_name|| '-'|| age || '-' || location"))
df_concat.show()
PySpark

I combined the Name, Age and Location columns and saved the result in a new column called “Name-Age-Location”.

Example:- B.) Use expr() to add a new column based on a condition (CASE WHEN).

# check if exercise needed based on weight
# if weight is more or equal to 60 -- Yes
# if weight is less than 55 -- No
# else -- Enjoy
df_condition = df_friends.withColumn("Exercise_Need", expr("CASE WHEN weight >= 60  THEN 'Yes' " + "WHEN  weight < 55  THEN 'No' ELSE 'Enjoy' END"))
df_condition.show()
PySpark

our “Exercise _ necessary” Column received 3 values (enjoy, no, yes) Based on the conditions given in case whenThe first value in the weight column is 58, less than 60 and greater than 55, so the result is: “fun.”

Example:- C.) Create a new column using the current column value in an expression.

# let increment the meetup month by the number of offset
df_meetup = df_friends.withColumn("new_meetup_date", expr("add_months(meetup_date,offset)"))
df_meetup.show()
code output

of “Meeting day” The value of month is incremented by the offset value and the newly generated result is “new_meetup_date” digit.

padding function

A.) lpad():-

This function provides padding on the left side of the column. The inputs for this function are the column name, length, and padding string.

B.) rpad():-

This function is used to add padding to the right side of the column. Column name, length and padding string are additional inputs for this function.

Note:-

  • If the column value is longer than the specified length, the returned value is truncated to the length characters or bytes.
  • If no padding value is specified, the column values ​​are padded left or right depending on the function you are using. Strings are padded with space characters, byte sequences are padded with zeros.

Let’s first create a dataframe:-

# importing necessary libs
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lpad, rpad

# creating session
spark = SparkSession.builder.appName("practice").getOrCreate()

# creating data
data = [("Delhi",30000),("Mumbai",50000),("Gujrat",80000)]
columns= ["state_name","state_population"]
df_states = spark.createDataFrame(data = data, schema = columns)
df_states.show()
code output

Example:- 01 – use left padding

# left padding
df_states = df_states.withColumn('states_name_leftpad', lpad(col("state_name"), 10, '#'))
df_states.show(truncate =False)
code output

Added ‘#’ symbol to the left of ” “state name is the column value and the total length of the column values ​​is “10 inch after padding.

Example: -02 – right padding

# right padding
df_states = df_states.withColumn('states_name_rightpad', rpad(col("state_name"), 10, '#'))
df_states.show(truncate =False)
code output

Added “#” The values ​​in the ‘state_name’ column have symbols to the right and the total length is 10 after right padding.

Example: -03 – if column string length is greater than padded string length

df_states = df_states.withColumn('states_name_condition', lpad(col("state_name"), 3, '#'))
df_states.show(truncate =False)
code output

In this case, the returned column value is shortened to the length of the padded string.you can see “state_name_condition” The column only has values ​​of length 3. padded length given as a function.

repeat() function

PySpark uses the repeat function to replicate column values. The repeat(str,n) function returns a string that is the specified string value repeated n times.

Example: -01

# importing necessary libs
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, repeat

# creating session
spark = SparkSession.builder.appName("practice").getOrCreate()

# # create data
data = [("Prashant",25, 80), ("Ankit",26, 90),("Ramakant", 24, 85)]
columns= ["student_name", "student_age", "student_score"]
df_students = spark.createDataFrame(data = data, schema = columns)
df_students.show()

# repeating the column (student_name) twice and saving results in new column
df_repeated = df_students.withColumn("student_name_repeated",(expr("repeat(student_name, 2)")))
df_repeated.show()
code output

In the example above I repeated the values ​​for the ‘student_name’ column Twice.

You can also use this function with the Concat function. This function allows the string value to be repeated n times before the column value, acting like padding. n is the length of the value.

Theith() and endwith() functions

starts with ():-

Produces a Boolean result of True or False. Returns True if the Dataframe column value ends with the string provided as a parameter to this method. Returns False if no match is found.

end with():-

Returns a boolean value (True/False). Returns True if the DataFrame column value ends with the string provided as input to this method. False is returned if there is no match.

Note:-

  • Returns 𝐍𝐔𝐋𝐋 if either the column value or the input string is 𝐍𝐔𝐋𝐋.
  • Returns 𝗧𝗿𝘂𝗲 if the input check string is empty.
  • These methods are case sensitive.

Create a dataframe:-

# importing necessary libs
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# creating session
spark = SparkSession.builder.appName("practice").getOrCreate()

# # create dataframe
data = [("Prashant",25, 80), ("Ankit",26, 90),("Ramakant", 24, 85), (None, 23, 87)]
columns= ["student_name", "student_age", "student_score"]
df_students = spark.createDataFrame(data = data, schema = columns)
df_students.show()
pie park

Example – 01 First, check the output type.

df_internal_res = df_students.select(col("student_name").endswith("it").alias("internal_bool_val"))
df_internal_res.show()
code output
  • The output is boolean.
  • The output value for the value in the last row is NULL because the corresponding value in the “students_name” column is NULL.

Example – 02

  • Then use the filter() method to get the rows that correspond to True values.
df_check_start = df_students.filter(col("student_name").startswith("Pra"))
df_check_start.show()
pie park

Here, we get the first row as output because the value of the ‘student_name’ column starts with the value mentioned in the function.

Here, I am getting two rows as output because the value of the “student_name” column ends with the value mentioned in the function.

In this case, you get the True value corresponding to each row and do not return the False value.

This article started the discussion by defining PySpark and its features. Next, we’ll discuss the functions, their definitions, and their syntax. After explaining each function, I created a data frame and used it to practice some examples. In this article, we’ve covered six features.

I hope this article helped you understand the features of PySpark. If you have any thoughts or questions, please comment below.connect with me LinkedIn for further discussion.

Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.

You may also like

Leave a Comment

About Us

We’re a provider of Data IT News and we focus to provide best Data IT News and Tutorials for all its users, we are free and provide tutorials for free. We promise to tell you what’s new in the parts of modern life Data professional and we will share lessons to improve knowledge in data science and data analysis field.