This article data science blogthon.
prologue
A key aspect of big data is the data frame. Pandak is the two most popular types of him. However, Spark is better suited for handling scaled distributed data, while Pandas is not. In contrast, Pandas’ API and syntax are easy to use. What if a user gets the best of both worlds? A library called Koalas gives users access to both worlds, eliminating the need to choose between them. So, this article!
The article here explains the rationale behind using Koalas, then explains the usage of different libraries with different Spark versions. Next, we will discuss the differences between Koala and Panda and run tests to validate those differences. The purpose is to help the reader establish a strong foundation for the former. Once the foundation is established, the next chapter will discuss what data scientists should consider when using it, ending with a summary and key takeaways. let’s start.
why koala?
The Koalas library was introduced because of the following issues with the current Spark system. [2].
- Apache Spark lacks some features that are frequently needed in data science. Specifically, plotting and drawing graphs is an important function that almost every data scientist uses on a daily basis.
- Data scientists typically prefer pandas’ APIs, but it’s difficult to switch to PySpark APIs when workloads need to scale. This is because the PySpark API is harder to learn and has many limitations compared to pandas.
Koala library variant
To use Koalas in your Spark notebook, you need to import the library. The library has two options. The first option is “databricks.koalas”, and prior to PySpark version 3.2.x, this is the only option available.But since 3.2.x, another library named “pyspark.pandas” was introduced, and this name matches the Pandas API more realistically. [3]Spark recommends using the latter as the former will be deprecated soon.
koala vs panda
Simply put, Koala can be seen as a PySpark data frame under the cover of pandas. It has all the benefits of Koala’s spark dataframe and how pandas interacts with it. Koalas API combines both the speed of spark and the ease of use of pandas to create a powerful and versatile API. The same concept is illustrated graphically in the figure below.
The main similarity between Panda and Koala is that the APIs used by both libraries are the same.That’s if pandas uses pd.DataFrame(), in Koala, the API usage is the same. in short, kl.DataFrame(). However, the difference between the two data frames (Panda and Koala) makes Koala really special. In other words, the koala data frames are distributed in a similar way to the spark data frames. [1]Unlike other Spark libraries, Pandas runs on a single node of the driver instead of all worker nodes, so it cannot be scaled. Contrary to Pandas, the Pandas API (aka Koala) works exactly like the spark library.
To understand the two terms, let’s use a sample program ( Github) to conduct some tests confirming the above differences.
The first test we performed was to see how the count operation performed in the data brick environment. Also, see if you observed anything different when comparing the three count operations on the spark, koala, and panda data frames respectively. The output is shown pictorially in the figure below. Make sure your pandas dataframe is not using worker threads. This is because when I perform operations on the pandas dataframe, I don’t see the spark job running (see image below). On this, spark and koalas data frames behave differently. Their spark created a job to complete the counting operation. These jobs are created on two separate workers (aka machines). This test confirms two things:
- First, sparks and koalas are no different in that they work.
- Second, Pandas is not scalable as the data load increases (that is, it always runs on a single node of the driver, regardless of data size). “Koalas”, on the other hand, are distributed in nature and can scale as data size changes.
The second test performed in the sample program is a performance check on various data frames. Here we calculated the execution time of the counting operation. The table below clearly shows that record count operations take significantly longer in spark and koala as opposed to pandas. This justifies that under them is just a Spark data frame. Another important thing to note here is that pandas count performance is much better than the other two.
The third test again checks that the two entities are similar, because if what is underneath the two entities is the same, there aren’t many operations that need to be performed by the data brick. To test this, a performance test was conducted on a data brick notebook to test the operation completion time when converting a Spark data frame to Koala and Panda respectively. The output shown here shows that Koala’s conversion time is negligible compared to Panda’s conversion time. This is because Koala data frames have the same structure, so there isn’t much to do for spark.
Evaluate the Koalas Read API using the complex data structure of the delta table
The most common way to persist data in modern Delta Lake is in delta format. Azure Databricks delta tables support ACID properties similar to transactional database tables. It’s worth checking how koalas (pandas API) works with delta tables containing complex JSON nesting structures. Let’s get started.
Sample data structure
The sample data shown here contains two columns. The first is the bank branch ID (simple data) and the second is the department details (a complex nested JSON structure). This data is stored as a delta table.
Sample data – code
Sample data can be created using the code below. The full codebase is Github.
# create payloads payload_data1 = {"EmpId": "A01", "IsPermanent": True, "Department": [{"DepartmentID": "D1", "DepartmentName": "Data Science"}]} payload_data2 = {"EmpId": "A02", "IsPermanent": False, "Department": [{"DepartmentID": "D2", "DepartmentName": "Application"}]} payload_data3 = {"EmpId": "A03", "IsPermanent": True, "Department": [{"DepartmentID": "D1", "DepartmentName": "Data Science"}]} payload_data4 = {"EmpId": "A04", "IsPermanent": False, "Department": [{"DepartmentID": "D2", "DepartmentName": "Application"}]} # create data structure data =[ {"BranchId": 1, "Payload": payload_data1}, {"BranchId": 2, "Payload": payload_data2}, {"BranchId": 3, "Payload": payload_data3}, {"BranchId": 4, "Payload": payload_data4} ] # dump data to json jsonData = json.dumps(data) # append json data to list jsonDataList = [] jsonDataList.append(jsonData) # parallelize json data jsonRDD = sc.parallelize(jsonDataList) # store data to spark dataframe df = spark.read.json(jsonRDD)
Store temporary data in delta tables
Persist the temporary employee data created above to the delta table using the code shown below.
table_name = "/testautomation/EmployeeTbl" (df.write .mode("overwrite") .format("delta") .option("overwriteSchema", "true") .save(table_name)) dbutils.fs.ls("./testautomation/")
Read Complex Nested Data Using Koala Data Frames
import pyspark.pandas as ps pdf = ps.read_delta(table_name) pdf.head()
When I ran the above code, I got the output below.
Read Complex Nested Data Using Spark Data Frames
df = spark.read.load(table_name) display(df)
Here is the output you see when you run the above code.
Figure 7: Complex JSON Data – Displayed by Display Function
The above results demonstrate that the Pandas API (Koalas) does not perform well when calling the HEAD function to display complex JSON nesting structures. This goes against the main principle of using Koalas libraries, as the ultimate intention is to provide a distribution mechanism for pandas libraries using pandas functions. However, this can be achieved through a workaround that allows you to use his DISPLAY function on Koala’s dataframe.
Conclusion
This article has set a strong foundation for the Koalas library for you. Additionally, various tests performed to validate the difference between pandas and koalas showed that koalas are nothing more than Spark data frames added to koalas using pandas APIs. In this post, we discussed a limitation that prevented Panda’s “head” function from displaying nested JSON data properly. In summary, it would be a mistake to say that Koalas is a good choice as your primary method for analyzing and transforming big data, but be aware of its limitations and just in case certain APIs don’t work well. Have a fallback plan.
important point
- Koalas increased productivity by enabling data engineers and data scientists to work more efficiently with big data.
- Do your research before using Koalas for complex, nested JSON structures, as Koalas’ API may not give you the results you expect.
- Koalas significantly bridges the gap between using the Pandas API on a single node and using distributed data.
Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.