pythonpandasgroup-byinterpolationpandas-groupby

Reputation: 21981

fill dataframe with NaN when multiple days data is missing

I have a pandas dataframe which I interpolate to get a daily dataframe. The original dataframe looks like this:

               col_1      vals 
2017-10-01  0.000000  0.112869 
2017-10-02  0.017143  0.112869 
2017-10-12  0.003750  0.117274 
2017-10-14  0.000000  0.161556 
2017-10-17  0.000000  0.116264

In the interpolated dataframe, I want to change data values to NaN where the gap in dates exceeds 5 days. E.g. in the dataframe above, the gap between 2017-10-02 and 2017-10-12 exceeds 5 days therefore in the interpolated dataframe all values between these 2 dates should be removed. I am not sure how to do this, maybe combine_first?

--EDIT: Interpolated dataframe looks like so:

            col_1      vals 
2017-10-01  0.000000  0.112869 
2017-10-02  0.017143  0.112869 
2017-10-03  0.015804  0.113309 
2017-10-04  0.014464  0.113750 
2017-10-05  0.013125  0.114190 
2017-10-06  0.011786  0.114631 
2017-10-07  0.010446  0.115071 
2017-10-08  0.009107  0.115512 
2017-10-09  0.007768  0.115953 
2017-10-10  0.006429  0.116393 
2017-10-11  0.005089  0.116834 
2017-10-12  0.003750  0.117274 
2017-10-13  0.001875  0.139415 
2017-10-14  0.000000  0.161556 
2017-10-15  0.000000  0.146459 
2017-10-16  0.000000  0.131361 
2017-10-17  0.000000  0.116264

Expected output:

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

Upvotes: 3

Answers (4)

hussam

Reputation: 859

I added a few more rows to your example in order to have two blocks with more than 5 day gap between rows.
I saved the two tables locally as .csv files and added date as the first column name to complete the merge below:

Setup

import pandas as pd
import numpy as np
df_1=pd.read_csv('df_1.csv', delimiter=r"\s+")
df_2=pd.read_csv('df_2.csv', delimiter=r"\s+")

merge (join) the two datasets and rename the columns:
notice two group with more than a 5 day gap.

df=df_2.merge(df_1, how='left', on='Date').reset_index(drop=True)
df.columns=['date','col','val','col_na','val_na']    #purely aesthetic

df

    date        col         val         col_na      val_na
0   2017-10-01  0.000000    0.112869    0.000000    0.112869
1   2017-10-02  0.017143    0.112869    0.017143    0.112869
2   2017-10-03  0.015804    0.113309    NaN         NaN
3   2017-10-04  0.014464    0.113750    NaN         NaN
4   2017-10-05  0.013125    0.114190    NaN         NaN
5   2017-10-06  0.011786    0.114631    NaN         NaN
6   2017-10-07  0.010446    0.115071    NaN         NaN
7   2017-10-08  0.009107    0.115512    NaN         NaN
8   2017-10-09  0.007768    0.115953    NaN         NaN
9   2017-10-10  0.006429    0.116393    NaN         NaN
10  2017-10-11  0.005089    0.116834    NaN         NaN
11  2017-10-12  0.003750    0.117274    0.003750    0.117274
12  2017-10-13  0.001875    0.139415    NaN         NaN
13  2017-10-14  0.000000    0.161556    0.000000    0.161556
14  2017-10-15  0.000000    0.146459    NaN         NaN
15  2017-10-16  0.000000    0.131361    NaN         NaN
16  2017-10-17  0.000000    0.989999    0.000000    0.116264
17  2017-10-18  0.000000    0.412311    NaN         NaN
18  2017-10-19  0.000000    0.166264    NaN         NaN
19  2017-10-20  0.000000    0.123464    NaN         NaN
20  2017-10-21  0.000000    0.149767    NaN         NaN
21  2017-10-22  0.000000    0.376455    NaN         NaN
22  2017-10-23  0.000000    0.000215    NaN         NaN
23  2017-10-24  0.000000    0.940219    NaN         NaN
24  2017-10-25  0.000000    0.030352    0.000000    0.030352
25  2017-10-26  0.000000    0.111112    NaN         NaN
26  2017-10-27  0.000000    0.002500    NaN         NaN

Method to carry out the task

def my_func(my_df):
    non_na_index=[]                                      #define empty list
    for i in range(len(my_df.iloc[:,[1]])):
        if not pd.isnull(my_df.iloc[i,[3]][0]):
            non_na_index.append(i)                       #add indexes of rows that that have non NaN value  
    sub=np.roll(non_na_index, shift=-1)-non_na_index     #subract column in indexes to find row count of NaN   
    sub=sub[:-1]                                         #get rid of last element (calculation artifact)
    for i in reversed(range(len(sub))):
        if sub[i]>=5:                       #identidy indexes with more than 5 NaN in between
            b=non_na_index[i+1]             #assign end index
            a=non_na_index[i]+1             #assign start index
            my_df=my_df.drop(my_df.index[[range(a,b)]])  #drop the rows within the range
    return(my_df)

execute the function using df

new_df=my_func(df)
new_df=df.drop(['col_na','val_na'],1)    # drop the two extra columns
new_df

    date        col         val
0   2017-10-01  0.000000    0.112869
1   2017-10-02  0.017143    0.112869
11  2017-10-12  0.003750    0.117274
12  2017-10-13  0.001875    0.139415
13  2017-10-14  0.000000    0.161556
14  2017-10-15  0.000000    0.146459
15  2017-10-16  0.000000    0.131361
16  2017-10-17  0.000000    0.989999
24  2017-10-25  0.000000    0.030352
25  2017-10-26  0.000000    0.111112
26  2017-10-27  0.000000    0.002500

Upvotes: 1

piRSquared

Reputation: 294338

I'd first identify where the gaps exceeded 5 days. From there, I generate an array that identified groups between such gaps. Finally, I'd use groupby to turn to daily frequency and interpolate.

# convenience: assign string to variable for easier access
daytype = 'timedelta64[D]'

# define five days for use when evaluating size of gaps
five = np.array(5, dtype=daytype)

# get the size of gaps
deltas = np.diff(df.index.values).astype(daytype)

# identify groups between gaps
groups = np.append(False, deltas > five).cumsum()

# handy function to turn to daily frequency and interpolate
to_daily = lambda x: x.asfreq('D').interpolate()

# and finally...
df.groupby(groups, group_keys=False).apply(to_daily)

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

In the event you want to provide your own interpolation method. You can modify the above like this:

daytype = 'timedelta64[D]'
five = np.array(5, dtype=daytype)
deltas = np.diff(df.index.values).astype(daytype)
groups = np.append(False, deltas > five).cumsum()

# custom interpolation function that takes a dataframe
def my_interpolate(df):
    """This can be whatever you want.
    I just provided what will result
    in the same thing as before."""
    return df.interpolate()

to_daily = lambda x: x.asfreq('D').pipe(my_interpolate)

df.groupby(groups, group_keys=False).apply(to_daily)

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

Upvotes: 10

Ken Wei

Reputation: 3130

Is this what you want?

data0 = """2017-10-01  0.000000  0.112869 
2017-10-02  0.017143  0.112869 
2017-10-12  0.003750  0.117274 
2017-10-14  0.000000  0.161556 
2017-10-17  0.000000  0.116264"""
data = [row.split('  ') for row in data0.split('\n')]

df = pd.DataFrame(data, columns = ['date','col_1','vals'])
df.date = pd.to_datetime(df.date)
last_observation = df.assign(last_observation = df.date.diff().dt.days)
df.set_index(['date'], inplace = True)

all_dates = pd.date_range(start = last_observation.date.min(), 
                          end = last_observation.date.max())
df_interpolated = df.reindex(all_dates).astype(np.float64).interpolate()
df_interpolated = df_interpolated.join(last_observation.set_index('date').last_observation)
df_interpolated['discard'] = (df_interpolated.last_observation.bfill() > 5) & df_interpolated.last_observation.isnull()
df_interpolated[['col_1','vals']] = df_interpolated[['col_1','vals']].where(~df_interpolated.discard)

The output is:

               col_1      vals  last_observation  discard
2017-10-01  0.000000  0.112869               NaN    False
2017-10-02  0.017143  0.112869               1.0    False
2017-10-03       NaN       NaN               NaN     True
2017-10-04       NaN       NaN               NaN     True
2017-10-05       NaN       NaN               NaN     True
2017-10-06       NaN       NaN               NaN     True
2017-10-07       NaN       NaN               NaN     True
2017-10-08       NaN       NaN               NaN     True
2017-10-09       NaN       NaN               NaN     True
2017-10-10       NaN       NaN               NaN     True
2017-10-11       NaN       NaN               NaN     True
2017-10-12  0.003750  0.117274              10.0    False
2017-10-13  0.001875  0.139415               NaN    False
2017-10-14  0.000000  0.161556               2.0    False
2017-10-15  0.000000  0.146459               NaN    False
2017-10-16  0.000000  0.131361               NaN    False
2017-10-17  0.000000  0.116264               3.0    False

The idea is that you generate the interpolation first (as you did), then decide which observations to drop. Start by assigning the number of days between the current observation and the last. Since you want to discard entries where this number exceeds 5, and the ones prior to it, use .bfill to give assign this number to the prior interpolations before comparing to 5. Notice however, that for positive discard decisions, the observation would be discarded, which you don't want. So you need to include the condition that you don't discard observations, which you check with the .notnull() method on the last_observation column.

Finally, use the .where method to keep entries that do not meet the discard criterion; by default, the others are replaced by NAs.

Upvotes: 1

akilat90

Reputation: 5696

If I understood correct, you can remove the unnecessary rows by boolean indexing. Assuming that you have the difference in days in a column called diff, you can use df.loc[df['diff'].dt.days < 5]

Here's a demo

df = pd.read_clipboard()

               col_1    vals
2017-10-01  0.000000    0.112869
2017-10-02  0.017143    0.112869
2017-10-12  0.003750    0.117274
2017-10-14  0.000000    0.161556
2017-10-17  0.000000    0.116264

Converting to a time column and obtaining a new column for the difference to the next value in days

df = df.reset_index()
df['index']=pd.to_datetime(df['index'])
df['diff'] = df['index'] - df['index'].shift(1)


       index    col_1       vals       diff
0   2017-10-01  0.000000    0.112869    NaT
1   2017-10-02  0.017143    0.112869    1 days
2   2017-10-12  0.003750    0.117274    10 days
3   2017-10-14  0.000000    0.161556    2 days
4   2017-10-17  0.000000    0.116264    3 days

Adding a boolian filter

new_df = df.loc[df['diff'].dt.days < 5]
new_df = new_df.drop('diff', axis=1)
new_df.set_index('index', inplace=True)
new_df

               col_1    vals
index       
2017-10-02  0.017143    0.112869
2017-10-14  0.000000    0.161556
2017-10-17  0.000000    0.116264

Upvotes: 1

fill dataframe with NaN when multiple days data is missing

Answers (4)

Related Questions