rocksNwaves
rocksNwaves

Reputation: 6154

How to choose the correct arguments of statsmodels STL function?

I've been reading about time-series decomposition, and have a fairly good idea of how it works on simple examples, but am having trouble extending the concepts.

For example, some simple synthetic data I'm playing with:

enter image description here

So there is no actual time associated with this data. It could be sampled every second or every year. Whatever the sampling frequency, the period is roughly 160 time steps, and using this as the period argument yields the expected results:

# seasonal=13 based on example in the statsmodels user guide
decomp = STL(synth.value, period=160, seasonal=13).fit()

fig, ax = plt.subplots(3,1, figsize=(12,6))
decomp.trend.plot(title='Trend', ax=ax[0])
decomp.seasonal.plot(title='Seasonal', ax=ax[1])
decomp.resid.plot(title='Residual', ax=ax[2])
plt.tight_layout()
plt.show()

enter image description here

But looking at other datasets, it's not really that easy to see the period of the seasonality, so it leads me to a couple of questions:

How do you find the correct arguments in real-world messy data, particularly the period argument but also the others too? Is it just a parameter search that you perform until the decomposition looks sane?

Parameters

endog : array_like Data to be decomposed. Must be squeezable to 1-d.

period : Periodicity of the sequence. If None and endog is a pandas Series or DataFrame, attempts to determine from endog. If endog is a ndarray, period must be provided.

seasonal : Length of the seasonal smoother. Must be an odd integer, and should normally be >= 7 (default).

trend : Length of the trend smoother. Must be an odd integer. If not provided uses the smallest odd integer greater than 1.5 * period / (1 - 1.5 / seasonal), following the suggestion in the original implementation.

Upvotes: 7

Views: 5843

Answers (1)

Bouke
Bouke

Reputation: 185

I had the same question. After tracing some of their codebase, I have found the following. This may help:

  • Statsmodels expects a DatetimeIndex'd DataFrame.
  • This DatetimeIndex can have a frequency. You can either resample your data with Pandas, or explicitly set a frequency in your index. You can check df.index, look for the freq attribute.

This leads to two situations:

Your index has frequency set

If you have set a frequency in your index, statsmodels will inherit this frequency and automatically use this to determine a period. It makes use of the freq_to_period method internally, defined here in the tsatools submodule.

To summarise what this does: The period is the expected periodicity of your seasonal component, translated back to a year..

In other words: "how often your seasonal cycle will repeat itself in a year". For reference, read the note on the freq_to_period method definition: Annual maps to 1, quarterly maps to 4, monthly to 12, weekly to 52.

This is both done for the method seasonal_decompose here, as well as for STL here.

Your index has no frequency set

It gets a bit more complicated if your data does not have a freq attribute set. The seasonal_decompose checks whether it can find an inferred_freq attribute of your index set here, STL takes the same approach here.

This inferred_freq was set using the pandas function infer_freq, which is defined in the Pandas package here, to Infer the most likely frequency given the input index.. Pandas automatically gives a DataFrame with a DatetimeIndex an index.inferred_freq attribute by default, if you have at least 3 elements.

TLDR: The period parameter should be set to the amount of times you expect the seasonal cycle to re-occur within a year. You can explicitly set this, or otherwise statsmodels will automatically infer this from the freq attribute of your datetimeindex. If the freq attribute is None, it will depend on Pandas' index.inferred_freq attribute to determine the frequency, and then convert this to pre-set periodicity.

Upvotes: 9

Related Questions