This article data science blogthon.
Data lineage is the process of analyzing the paths of data and how it participates in different ways over time. Many businesses and corporations use it to understand the sources, paths of data, and how data is being used. It helps organizations gain insights from data, plan future steps, and use data to improve the performance of their products or services.
This article describes three case studies where data-driven companies such as Netflix, Slack, and Postman have implemented and benefited from data lineage. Here we also describe the process and the techniques applied during implementation and use.
Data lineage case study
Some data-driven businesses like Netflix, Slack, UBS, Postman and Airbnb have seen the benefits of data lineage and are now using it to make money. Learn about the data integration process at these companies and the benefits they derive from it.
Netflix saw the benefits of data lineage and implemented it. At the outset of the project, we defined design goals to help guide the architecture and development efforts to provide a complete, accurate, reliable, and scalable lineage system that maps Netflix’s diverse data landscape. Some of these principles are:
- Ensure data integrity
- Enable seamless integration
- Design a flexible data model
Based on the entity-level standard data model, we built a generic relationship model to describe dependencies between arbitrary pairs of entities. Using this approach, you can create a unified data model and repository, provide appropriate leverage, and enable multiple use cases such as data discovery, SLA services, and data efficiency.
Slack believes in the benefits of data lineage, which is why they are making similar investments. Slack says that as datasets become more complex and the number of contributors grows, it becomes increasingly difficult to understand the relationships between different data sources.
To make the lineage data easily available, we created a flat version of the hierarchy table and stored it in Hive. Flattened tables enable users to query the lineage data in the data warehouse, and also make it easier to create and run queries for common use cases.
I’ve also been working on a notification system with the help of data lineage. They built a notification tool in their internal Data Portal to allow data consumers to use lineage information and notify downstream consumers. There is a notification button that allows the dataset owner to retrieve information.
Postman also fixed missing layers in the data layer. Postman’s data system was very simple. They had a set of data tables and the information about those tables was in the heads of his members of the data team early on. This worked when the company and its data were small, but needed help keeping up when it started growing exponentially.
Postman currently has hundreds of team members spread across four continents and over 17 million users across 500,000 companies using its API platform.
Postman co-founder and CTO Ankit Sobti wanted to ensure data was democratized. He said it is a daunting task for the data engineering team to gain insights from data at any time of his day. He believed that everyone in the company should be able to access data and gain insights. This has become very boring in his 2020, when the COVID pandemic has put Potman completely online.
The data team decided to adopt Postman’s data system as a project to address this issue. Their primary goal was to make Postman’s data easier to access and understand, both for new hires within the data team and for people throughout the company, with the help of data lineage.
They use data lineage to understand where data comes from and how it is connected to other layers. Data lineage helped us understand data connectivity and the day-to-day bugs and errors that occur in our systems. It helped me resolve the issue faster. There is no doubt that the Slack team can solve problems simply by looking at the data lineage. We also plan to take further steps on data lineage to make data management more accessible and faster.
When data lineage is easy (not useful) for some organizations
Data lineage has proven to be the best solution for most organizations dealing with data and data management. Still, in some cases it has proven easier for organizations.
Some organizations store large amounts of data and use many data sources and storage. Data lineage is easy to prove, as such organizations need to provide the most reliable information for such data.
Data lineage provides information about the data source and the entire lifecycle of the data. A data design lineage can help you get an idea of your data head and consumption. However, it is useful for architects to understand the data flow implementation. However, business subject matter experts who wish to audit data processing may find it complex to navigate.
Business lineage provides a simplified view for analyzing business types through design lineage. A business lineage report may show only the critical systems, or it may exclude systems and job structures just to show the transformation.
In this way, data series are designed to display things quickly and easily, not to search for items. Suppose your organization deals with large amounts of frequently changing data or disparate data sources. You can show a flow chart or lifecycle of your data, so you can’t find the information you want from the data. Still, the results are only reliable for small or changing data. So for organizations dealing with a large range of data, this has proven to be a no-brainer.
Conclusion
This article described research on data-driven companies implementing, using, and benefiting from data lineage and its applications. Data-driven companies such as Netflix, Slack, and Postman have used this concept in their databases with positive results. Knowledge of these companies and their data lineage processes will help you understand how the big data companies use this and will help you answer the questions asked in data engineering interviews very efficiently. also helps.
A few important point From this article:
1. Most data-driven companies today use data lineage to improve data governance and processing.
2. Enterprises with data sources can implement data lineage very efficiently and immediately have more information about the data being used.
3. Easier or less useful for companies that generate less data or startups using lighter weight databases.
Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.