In my last article, I explained what SCD means for business intelligence solutions. He also explained that in a Power BI implementation he was not supposed to handle SCD2, but could handle scenarios similar to SCD1. This post will show you how.
scenario
We have a retail company that sells our products. Every year, the company publishes a list of products, including list prices and store prices, in Excel format. The product list will be released on July 1st, when the fiscal year begins. You need to implement a Power BI solution with up-to-date product data to analyze your sales transactions. The following image shows the product list for 2013.
So every year I receive an Excel file similar to the image above. Files are stored on your SharePoint Online site.
Scenario description
As explained in the previous post, SCD1 always keeps current data by updating old data with new data. Therefore, the ETL process reads data from the source, identifies existing data in the destination table, inserts new rows into the destination, updates existing rows, and deletes deleted rows.
Here’s why this scenario is similar to SCD1: However, there is one exception.
- It doesn’t actually update the data in the Excel file, nor does it create an ETL process that reads the data from the Excel file, identifies the changes, and applies the changes to an intermediate Excel file.
- I need to read the data from the source Excel file, filter out the old data while keeping the latest data, and load the data into the data model.
As you can see, it takes a very different implementation approach, but the result is very similar, no rows are removed.
Implementation
Here’s what we should do to reach our goal:
- to get the data in Power Query Editor. SharePoint folder connector
- combine files
- we, Product No Column for identifying duplicate products
- we, Reporting date A column that identifies the latest date
- keep only the most recent rows
Get data from SharePoint Online folders
To retrieve data from multiple files stored in SharePoint Online, SharePoint folder connector. Do the following:
- login SharePoint Online Go to the site that holds the product list Excel file and copy the site URL from your browser.
- from get data In Power BI desktop, SharePoint folder connector
- click connection
- pasting Site URL Copy in step 1
- click all right
- click data conversion
- click the filter dropdown. folder path digit
- Find and select the Products List folder that hosts your Excel file.
- click all right
- the name of the query product
So far, I’ve connected to a SharePoint Online folder in Power Query Editor. The next step is to combine the Excel files.
Combining files
There are multiple options for joining binary files to tables from Power Query Editor. This post uses the easiest method.
- click combine files button from content digit
- choose Product list table
- click all right
The above process creates several queries grouped in separate folders as shown in the following image.
So far, you have successfully merged your Excel files. The next step is to keep only the latest data.
Keep your data up-to-date
The next few steps are to dig into the data and implement mechanisms to identify and retain the most recent data and load it into the data model.
- Looking at the result of the combined data, source.name You need to keep the column, so select it and click to remove it. delete column button from House tab
So far, we’ve connected to a SharePoint Online folder and merged the Excel files it contains. Let’s look at the data and see what we got.I sorted the data by Product No To better understand data changes. The following figure shows the result.
As the image above shows, there are multiple listings displaying multiple products. It’s exactly what we expected. The goal is to keep it based on the latest product data only. Reporting date. Therefore, it is necessary to obtain Product No maximum value of Reporting date. To achieve this, grouping Features of the Power Query Editor. Using Group By from her UI in the Power Query Editor, Table.Group()
Works in Power Query. Because the Group By process does not require data to be sorted, sorted rows step. Now let’s get the job done.
- choose Product No digit
- click grouping columns from Transform tab
- come in Reporting date for new column name
- select Max from surgery drop down
- choose Reporting date from digit drop down
- click all right
The following figure shows the result.
Now you have all the product numbers and the latest reporting date. The only remaining piece of the puzzle is grouped rows Step using the data from the previous step. for that reason, query merging function to perform Table.NestedJoin()
Works in Power Query.
- choose grouped rows steps from Application procedure list from query settings pain
- click query merging button from House tab
- Select the Product (Current) table from the dropdown. Note that we have selected the current query (product).
- At the top table, press the Ctrl button on your keyboard to Product No and the Reporting date columns are sequential
- Do the same for the table below. Note that the order in which you select the columns matters
- Please make sure that join kind teeth Left Outer (first to all, second to match)
- click all right
As mentioned above, in a merge operation Table.NestedJoin()
A function that accepts two tables (highlighted in the figure) yellow List of key columns to use in the join (highlighted in formula below) red the name of the new column in type table and the type of bond. In the above operation, grouped rows is the final transformation step. grouped rows A transformation step in itself. Here is the code generated by the Power Query Editor after steps 21-27.
Table.NestedJoin(#"Grouped Rows", {"ProductNumber", "Reporting Date"}, #"Grouped Rows", {"ProductNumber", "Reporting Date"}, "Grouped Rows", JoinKind.LeftOuter)
But that’s not what we want. grouped rows A transformation step in itself. You need to combine the results of grouped rows Transformation step with the result of Deleted column step. So the above formula should be changed to:
Table.NestedJoin(#"Grouped Rows", {"ProductNumber", "Reporting Date"}, #"Removed Columns", {"ProductNumber", "Reporting Date"}, "Grouped Rows", JoinKind.LeftOuter)
The following illustration shows the changes made to the formula and the result.
- click expansion button grouped rows digit
- deselect Product No When Reporting date Column to keep other columns selected
- uncheck Use original column name as prefix option
- click all right
All done! The following figure shows the final result.
You can now apply your changes and load the data into the data model. Using this technique, when a new Excel file (new product list) arrives in SharePoint Online, Power BI will perform the transformation steps above to ensure that the latest product data is always loaded into the data model. This is very similar to behavior. of SCD1.
Have you used this method before? Do you have better techniques for handling similar scenarios? You can share your thoughts in the comments section below.