Smoothing time seriesm, taking into account seasonality

Question

My time series has the following figure showing outliers:

What the best way to smooth the time series in python pandas taking into consideration seasonality.

The intended result as below:

Time series sample:

   dat_time            system_load
01-01-2013  00:00:00        200
01-01-2013  00:00:00        210
01-01-2013  00:00:00        230
  ...
31-12-2013  21:00:00        500
31-12-2013  22:00:00        220
31-12-2013  23:00:00        200
       ...
01-01-2020  00:00:00        300
01-01-2020  00:00:00        310
01-01-2020  00:00:00        730
      ...
31-12-2020  21:00:00        350
31-12-2020  22:00:00        270
31-12-2020  23:00:00        280

Frodnar · Accepted Answer

First, what you're asking about is called "time-series anomaly detection," and it's a real-world problem with significant scientific and business applications. For a great overview of available libraries in modern programming languages (including Python), I recommend awesome-TS-anomaly-detection.

To answer your question truthfully, there isn't really a convenient, pandas-only way to accomplish this task. There are so many choices and assumptions to be made with varying tradeoffs and so many approaches to a solution, that it doesn't make sense to try to simplify it for a library as general as pandas.

Here is one VERY crude solution using only pandas and numpy functionality. From a statistics perspective, it is overly simplified, not robust, and makes a lot of assumptions about your data and the nature of the problem. However, it appears to generally answer your question.

Generate synthetic data that looks kind of like yours:

import pandas as pd
import numpy as np

size = 500
num_outliers = 50

sin_x = pd.Series(np.sin(0.05*x - np.pi / 2) + 25 for x in range(size))

noise_s = pd.Series(np.random.uniform(-1.0, 1.0, size=size))

outlier_x = np.random.randint(0,size - 2, num_outliers)
outlier_y = np.random.uniform(-3.0, 10.0, num_outliers)
outliers = pd.Series(outlier_y, index=outlier_x)

date_idx = pd.date_range(start='20170101', end='20200101', periods=size)
data_vals = (noise_s + sin_x).add(outliers, fill_value=0).values[:size]
data = pd.Series(data=data_vals, index=date_idx)

# Plot
ax1 = data.plot()
ax1.set_ylim(data.min() - 1, data.max() + 1)

Since your seasonality looks yearly, find the rolling average with a window that is roughly a couple of months wide. For some applications, this may be close to what you want already. I took your question literally to mean "only get rid of outliers," so let's move past this to simply filter the original data.

The window_param can be adjusted: larger values make narrower windows. A good balance is usually when the plot rolling average plot is somewhat smooth but the periodic nature is still strongly visible.

window_param = 12  # adjustable
window = int(np.round(size / window_param))
rolling_ave = data.rolling(window=window, center=True).sum().dropna() / window

# Plot
ax2 = rolling_ave.plot()
ax2.set_ylim(data.min() - 1, data.max() + 1)

Find the difference of the data and the rolling average. Note that the very beginning and end of the time series did not get subtracted because there weren't enough surrounding points to fit the rolling average window.

diff = data.subtract(rolling_ave, fill_value=0)

# Plot
ax3 = diff.plot()
ax3.set_ylim(data.min() - 1, data.max() + 1)

Remove outliers where the absolute value of diff is greater than the standard deviation (excluding the beginning and end bits that didn't get subtracted in step 3). This is the part that is statistically cringe-worthy, but it's practical. std_param is an adjustable parameter. Larger values remove fewer values, and smaller values remove more values.

std_param_1 = 1.0  # adjustable
data_outliers_removed = data[(diff.abs() < diff[rolling_ave.index].std() * std_param)]

# Plot
ax4 = data_outliers_removed.plot()
ax4.set_ylim(data.min() - 1, data.max() + 1)

[Optional] 5. Put back the original end data, and remove outliers from this section with a similar standard deviation criterion. Once again, std_param_2 is adjustable. Larger values remove fewer outliers.

std_param_2 = 2.0  # adjustable
data_begin = data[data.index < rolling_ave.index[0]]
data_end = data[data.index > rolling_ave.index[-1]]
data_capped = pd.concat([data_begin, data_outliers_removed, data_end])
data_w_ends = data_capped[(data_capped - data_capped.mean()).abs() < data_outliers_removed.std() * std_param_2]

# Plot
ax5 = data_w_ends.plot()
ax5.set_ylim(data.min() - 1, data.max() + 1)

[Optional] 6. Fill the holes created by eliminating outlier data by performing linear interpolation from surrounding points. There are too many data points for this to be obvious in the plot. This step can be useful if your application needs data with a regularly spaced index and without NaN values.

data_holey = data_w_ends.reindex(data.index)
data_patched = data_holey.interpolate()

# Plot
ax6 = data_patched.plot()
ax6.set_ylim(data.min() - 1, data.max() + 1)

Smoothing time seriesm, taking into account seasonality

Answers (1)

Related Questions