Reputation: 87
My time series has the following figure showing outliers:
What the best way to smooth the time series in python pandas taking into consideration seasonality.
The intended result as below:
Time series sample:
dat_time system_load
01-01-2013 00:00:00 200
01-01-2013 00:00:00 210
01-01-2013 00:00:00 230
...
31-12-2013 21:00:00 500
31-12-2013 22:00:00 220
31-12-2013 23:00:00 200
...
01-01-2020 00:00:00 300
01-01-2020 00:00:00 310
01-01-2020 00:00:00 730
...
31-12-2020 21:00:00 350
31-12-2020 22:00:00 270
31-12-2020 23:00:00 280
Upvotes: 1
Views: 714
Reputation: 2252
First, what you're asking about is called "time-series anomaly detection," and it's a real-world problem with significant scientific and business applications. For a great overview of available libraries in modern programming languages (including Python), I recommend awesome-TS-anomaly-detection.
To answer your question truthfully, there isn't really a convenient, pandas
-only way to accomplish this task. There are so many choices and assumptions to be made with varying tradeoffs and so many approaches to a solution, that it doesn't make sense to try to simplify it for a library as general as pandas
.
Here is one VERY crude solution using only pandas
and numpy
functionality. From a statistics perspective, it is overly simplified, not robust, and makes a lot of assumptions about your data and the nature of the problem. However, it appears to generally answer your question.
import pandas as pd
import numpy as np
size = 500
num_outliers = 50
sin_x = pd.Series(np.sin(0.05*x - np.pi / 2) + 25 for x in range(size))
noise_s = pd.Series(np.random.uniform(-1.0, 1.0, size=size))
outlier_x = np.random.randint(0,size - 2, num_outliers)
outlier_y = np.random.uniform(-3.0, 10.0, num_outliers)
outliers = pd.Series(outlier_y, index=outlier_x)
date_idx = pd.date_range(start='20170101', end='20200101', periods=size)
data_vals = (noise_s + sin_x).add(outliers, fill_value=0).values[:size]
data = pd.Series(data=data_vals, index=date_idx)
# Plot
ax1 = data.plot()
ax1.set_ylim(data.min() - 1, data.max() + 1)
The window_param
can be adjusted: larger values make narrower windows. A good balance is usually when the plot rolling average plot is somewhat smooth but the periodic nature is still strongly visible.
window_param = 12 # adjustable
window = int(np.round(size / window_param))
rolling_ave = data.rolling(window=window, center=True).sum().dropna() / window
# Plot
ax2 = rolling_ave.plot()
ax2.set_ylim(data.min() - 1, data.max() + 1)
diff = data.subtract(rolling_ave, fill_value=0)
# Plot
ax3 = diff.plot()
ax3.set_ylim(data.min() - 1, data.max() + 1)
diff
is greater than the standard deviation (excluding the beginning and end bits that didn't get subtracted in step 3). This is the part that is statistically cringe-worthy, but it's practical. std_param
is an adjustable parameter. Larger values remove fewer values, and smaller values remove more values.std_param_1 = 1.0 # adjustable
data_outliers_removed = data[(diff.abs() < diff[rolling_ave.index].std() * std_param)]
# Plot
ax4 = data_outliers_removed.plot()
ax4.set_ylim(data.min() - 1, data.max() + 1)
[Optional] 5. Put back the original end data, and remove outliers from this section with a similar standard deviation criterion. Once again, std_param_2
is adjustable. Larger values remove fewer outliers.
std_param_2 = 2.0 # adjustable
data_begin = data[data.index < rolling_ave.index[0]]
data_end = data[data.index > rolling_ave.index[-1]]
data_capped = pd.concat([data_begin, data_outliers_removed, data_end])
data_w_ends = data_capped[(data_capped - data_capped.mean()).abs() < data_outliers_removed.std() * std_param_2]
# Plot
ax5 = data_w_ends.plot()
ax5.set_ylim(data.min() - 1, data.max() + 1)
[Optional] 6. Fill the holes created by eliminating outlier data by performing linear interpolation from surrounding points. There are too many data points for this to be obvious in the plot. This step can be useful if your application needs data with a regularly spaced index and without NaN
values.
data_holey = data_w_ends.reindex(data.index)
data_patched = data_holey.interpolate()
# Plot
ax6 = data_patched.plot()
ax6.set_ylim(data.min() - 1, data.max() + 1)
Upvotes: 1