Whats the best way to fill the missing data in the time series using Python?

For the first time, I am trying to work on a case study using python for continuous dataframe, which is the time series data of properties during the period 2006-2016

But I have missing values for the year 2015-16 in columns A,B,C,D and 2006-07 in E and F columns. I am trying to impute the values and fill the data.

I have tried MICE and Interpolation but am not sure if it's even correct or not. which method to apply and how to apply it in python? I have gone through links:

https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/ https://www.researchgate.net/post/What_is_a_reliable_method_of_dealing_with_missing_data_in_time_series_records

Should I be using forecasting method instead of imputation to fill the data?

Please help.

Upvotes: 0

Answers (2)

LittleHealth

Reputation: 122

There isn't always one best way to fill missing values in fact. Here are some methods used in python to fill values of time series.missing-values-in-time-series-in-python

Filling missing values a.k.a imputation is a well-studied topic in computer science and statistics.

Previously, we used to impute data with mean values regardless of data types. A big problem that mean imputation(all const imputation) triggers is mutations in time series.

Later, autoregressive(AR) and moving average(MA) used for modeling time series are used in imputation. These methods have a strong theoretical basis STAT510 and are used to forecast/impute time series.

Matrix Factorization is another important method, such as TRMF, SVD, PCA. A recent benchmark about MF imputation was published in PVLDB.Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series.

Besides, there are other machine/deep learning methods proposed recently. There is a survey about imputation methods used in time seriesTime Series Data Imputation: A Survey on Deep Learning Approaches, which may help you a lot. However, the methods mentioned in this survey are not accurate enough.

Back to your question, MICE is just a framework where you can use any regression algorithms. It assumes that different columns(A, B, C, and E, F) are correlated.

Forecasting and imputation are the same by nature. You can think that forecasting is a special case of imputation without succeeding data.

You'd better try more imputation methods to find the best one.

Upvotes: 2

Nikaido

Reputation: 4629

In your case, if you fill your empty cells with estimated values, the results of your analysis will be very skewd. Because you have a very limited sample size.

If you have more data (e.g. more years), you can try different methods to fill the empty values in your dataset (interpolations, mean, etc). There are pros and cons for every method. It depends on what you need to do with this time series.

If you have only that data, it would make sense to use only the period in which you have the data for every column, but, again, having so few rows will led you to not so interesting results.

Anyway, pandas dataframes offer a lot of libs and utils to handle this problem.

For example the dataframe method fillna:

df = # your dataframe
df.fillna(method='ffill')

Which will propagate last valid observation forward to next valid

Or the interpolate method:

df.interpolate(method ='linear', limit_direction ='forward')

But there is no perfect answer to your question. You need to reason on your data and make a decision based on the context

Upvotes: 2

Whats the best way to fill the missing data in the time series using Python?

Answers (2)

Related Questions