Handling Missing Values in Time-series with SQL

by datatabloid_difmmk

read this morning Madison Shotof paper regarding LAST_VALUE Here she highlights the usefulness of this little-known SQL function.

It inspired me to write a follow-up article on a specific use case that comes up often when working with time series data.

Let’s say you’re building a predictive maintenance model using sensor data.

After some controversy, we get hourly data like this:

Example of preprocessed sensor data

At this point, you’ve already done some pretty significant data engineering to create these evenly spaced observations every hour. How To do this is the subject of another article. Note, however, that there are some gaps in the temperature measurement.Here is LAST_VALUE come to the rescue.

Usually the reason values ​​are missing is because sensors only report changes in value. This reduces the amount of data the machine has to send, but creates data issues that must be resolved.

Using this data directly to build a model will result in a loss of accuracy when certain values ​​are missing. This is because there is no historical context. written in the row itselfTo create the most accurate model possible, we need to add features such as:

  • last temperature reading
  • Average temperature over the last 6 hours
  • Time since temperature reading rises/falls
  • Rate of temperature change over the last 12 hours

Handling Missing Values ​​in Time Series with SQL

An illustration of the types of features that are useful in predictive models

The first step is to replace missing values ​​with the last known value.why we do this first timebecause it makes writing other functions much easier.

For example, if you leave them missing and try to compute a moving average, the average will not be computed correctly (missing values ​​will be ignored and only non-missing values ​​will be averaged).

Average temperature 4 hours ago (with gaps)

(null + 85 + null + null) / 1 = 85

Average temperature 4 hours ago (replaced)

(84 + 85 + 85 + 85) / 4 = 84.75

In Python, start with forward fillHowever, doing this in SQL means that you can leverage the power of your data warehouse.

In SQL, LAST_VALUE. look This article A more detailed explanation is required.

The syntax is:

SELECT 
  MACHINE_ID, 
  OBSERVATION_DATETIME, 
  LAST_VALUE(
    CASING_TEMPERATURE_F ignore NULLS
  ) OVER (
    PARTITION BY MACHINE_ID 
    ORDER BY 
      OBSERVATION_DATETIME ROWS BETWEEN UNBOUNDED PRECEDING 
      AND CURRENT ROW
  ) AS LATEST_CASING_TEMPERATURE_F, 
  LAST_VALUE(
    BEARING_TEMPERATURE_F ignore NULLS
  ) OVER (
    PARTITION BY MACHINE_ID 
    ORDER BY 
      OBSERVATION_DATETIME ROWS BETWEEN UNBOUNDED PRECEDING 
      AND CURRENT ROW
  ) AS LATEST_BEARING_TEMPERATURE_F, 
  LAST_VALUE(FLYWHEEL_RPM ignore NULLS) OVER (
    PARTITION BY MACHINE_ID 
    ORDER BY 
      OBSERVATION_DATETIME ROWS BETWEEN UNBOUNDED PRECEDING 
      AND CURRENT ROW
  ) AS LATEST_FLYWHEEL_RPM, 
--8

Result of replacing missing part with LAST_VALUE

I have!

hopefully i was able to shed some light LAST_VALUE and a cousin. FIRST_VALUE is a lesser-known SQL Window function.

Josh Berry (@twitter) leads Customer Facing Data Science at Rasgo and has been a data and analytics professional since 2008. Josh worked at Comcast for 10 years where he built the Data Science team and was developed in-house and he was the primary owner of the Comcast feature store. Feature store on the market. Following Comcast, Josh was a key leader in building data science for customers at DataRobot. In his spare time, Josh does intricate analysis on interesting topics such as baseball, Formula 1 racing, and housing market forecasts.

original. Reprinted with permission.

You may also like

Leave a Comment

About Us

We’re a provider of Data IT News and we focus to provide best Data IT News and Tutorials for all its users, we are free and provide tutorials for free. We promise to tell you what’s new in the parts of modern life Data professional and we will share lessons to improve knowledge in data science and data analysis field.