Let’s see the actual implementation:-
Example:- A.) Concatenate one or more columns using expr()
# concate friend's name, age, and location columns using expr() df_concat = df_friends.withColumn("name-age-location",expr("friends_name|| '-'|| age || '-' || location")) df_concat.show()
I combined the Name, Age and Location columns and saved the result in a new column called “Name-Age-Location”.
Example:- B.) Use expr() to add a new column based on a condition (CASE WHEN).
# check if exercise needed based on weight # if weight is more or equal to 60 -- Yes # if weight is less than 55 -- No # else -- Enjoy df_condition = df_friends.withColumn("Exercise_Need", expr("CASE WHEN weight >= 60 THEN 'Yes' " + "WHEN weight < 55 THEN 'No' ELSE 'Enjoy' END")) df_condition.show()
our “Exercise _ necessary” Column received 3 values (enjoy, no, yes) Based on the conditions given in case whenThe first value in the weight column is 58, less than 60 and greater than 55, so the result is: “fun.”
Example:- C.) Create a new column using the current column value in an expression.
# let increment the meetup month by the number of offset df_meetup = df_friends.withColumn("new_meetup_date", expr("add_months(meetup_date,offset)")) df_meetup.show()
of “Meeting day” The value of month is incremented by the offset value and the newly generated result is “new_meetup_date” digit.
padding function
A.) lpad():-
This function provides padding on the left side of the column. The inputs for this function are the column name, length, and padding string.
B.) rpad():-
This function is used to add padding to the right side of the column. Column name, length and padding string are additional inputs for this function.
Note:-
- If the column value is longer than the specified length, the returned value is truncated to the length characters or bytes.
- If no padding value is specified, the column values are padded left or right depending on the function you are using. Strings are padded with space characters, byte sequences are padded with zeros.
Let’s first create a dataframe:-
# importing necessary libs from pyspark.sql import SparkSession from pyspark.sql.functions import col, lpad, rpad # creating session spark = SparkSession.builder.appName("practice").getOrCreate() # creating data data = [("Delhi",30000),("Mumbai",50000),("Gujrat",80000)] columns= ["state_name","state_population"] df_states = spark.createDataFrame(data = data, schema = columns) df_states.show()
Example:- 01 – use left padding
# left padding df_states = df_states.withColumn('states_name_leftpad', lpad(col("state_name"), 10, '#')) df_states.show(truncate =False)
Added ‘#’ symbol to the left of ” “state name is the column value and the total length of the column values is “10 inch after padding.
Example: -02 – right padding
# right padding df_states = df_states.withColumn('states_name_rightpad', rpad(col("state_name"), 10, '#')) df_states.show(truncate =False)
Added “#” The values in the ‘state_name’ column have symbols to the right and the total length is 10 after right padding.
Example: -03 – if column string length is greater than padded string length
df_states = df_states.withColumn('states_name_condition', lpad(col("state_name"), 3, '#')) df_states.show(truncate =False)
In this case, the returned column value is shortened to the length of the padded string.you can see “state_name_condition” The column only has values of length 3. padded length given as a function.
repeat() function
PySpark uses the repeat function to replicate column values. The repeat(str,n) function returns a string that is the specified string value repeated n times.
Example: -01
# importing necessary libs from pyspark.sql import SparkSession from pyspark.sql.functions import expr, repeat # creating session spark = SparkSession.builder.appName("practice").getOrCreate() # # create data data = [("Prashant",25, 80), ("Ankit",26, 90),("Ramakant", 24, 85)] columns= ["student_name", "student_age", "student_score"] df_students = spark.createDataFrame(data = data, schema = columns) df_students.show() # repeating the column (student_name) twice and saving results in new column df_repeated = df_students.withColumn("student_name_repeated",(expr("repeat(student_name, 2)"))) df_repeated.show()
In the example above I repeated the values for the ‘student_name’ column Twice.
You can also use this function with the Concat function. This function allows the string value to be repeated n times before the column value, acting like padding. n is the length of the value.
Theith() and endwith() functions
starts with ():-
Produces a Boolean result of True or False. Returns True if the Dataframe column value ends with the string provided as a parameter to this method. Returns False if no match is found.
end with():-
Returns a boolean value (True/False). Returns True if the DataFrame column value ends with the string provided as input to this method. False is returned if there is no match.
Note:-
- Returns 𝐍𝐔𝐋𝐋 if either the column value or the input string is 𝐍𝐔𝐋𝐋.
- Returns 𝗧𝗿𝘂𝗲 if the input check string is empty.
- These methods are case sensitive.
Create a dataframe:-
# importing necessary libs from pyspark.sql import SparkSession from pyspark.sql.functions import col # creating session spark = SparkSession.builder.appName("practice").getOrCreate() # # create dataframe data = [("Prashant",25, 80), ("Ankit",26, 90),("Ramakant", 24, 85), (None, 23, 87)] columns= ["student_name", "student_age", "student_score"] df_students = spark.createDataFrame(data = data, schema = columns) df_students.show()
Example – 01 First, check the output type.
df_internal_res = df_students.select(col("student_name").endswith("it").alias("internal_bool_val")) df_internal_res.show()
- The output is boolean.
- The output value for the value in the last row is NULL because the corresponding value in the “students_name” column is NULL.
Example – 02
- Then use the filter() method to get the rows that correspond to True values.
df_check_start = df_students.filter(col("student_name").startswith("Pra")) df_check_start.show()
Here, we get the first row as output because the value of the ‘student_name’ column starts with the value mentioned in the function.
Here, I am getting two rows as output because the value of the “student_name” column ends with the value mentioned in the function.
In this case, you get the True value corresponding to each row and do not return the False value.
This article started the discussion by defining PySpark and its features. Next, we’ll discuss the functions, their definitions, and their syntax. After explaining each function, I created a data frame and used it to practice some examples. In this article, we’ve covered six features.
I hope this article helped you understand the features of PySpark. If you have any thoughts or questions, please comment below.connect with me LinkedIn for further discussion.
Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.