Reputation: 21981
I have a pandas dataframe which I interpolate to get a daily dataframe. The original dataframe looks like this:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264
In the interpolated dataframe, I want to change data values to NaN where the gap in dates exceeds 5 days. E.g. in the dataframe above, the gap between 2017-10-02
and 2017-10-12
exceeds 5 days therefore in the interpolated dataframe all values between these 2 dates should be removed. I am not sure how to do this, maybe combine_first
?
--EDIT: Interpolated dataframe looks like so:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-03 0.015804 0.113309
2017-10-04 0.014464 0.113750
2017-10-05 0.013125 0.114190
2017-10-06 0.011786 0.114631
2017-10-07 0.010446 0.115071
2017-10-08 0.009107 0.115512
2017-10-09 0.007768 0.115953
2017-10-10 0.006429 0.116393
2017-10-11 0.005089 0.116834
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
Expected output:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
Upvotes: 3
Views: 1138
Reputation: 859
I added a few more rows to your example in order to have two blocks with more than 5 day gap between rows.
I saved the two tables locally as .csv files and added date
as the first column name to complete the merge below:
Setup
import pandas as pd
import numpy as np
df_1=pd.read_csv('df_1.csv', delimiter=r"\s+")
df_2=pd.read_csv('df_2.csv', delimiter=r"\s+")
merge (join) the two datasets and rename the columns:
notice two group with more than a 5 day gap.
df=df_2.merge(df_1, how='left', on='Date').reset_index(drop=True)
df.columns=['date','col','val','col_na','val_na'] #purely aesthetic
df
date col val col_na val_na
0 2017-10-01 0.000000 0.112869 0.000000 0.112869
1 2017-10-02 0.017143 0.112869 0.017143 0.112869
2 2017-10-03 0.015804 0.113309 NaN NaN
3 2017-10-04 0.014464 0.113750 NaN NaN
4 2017-10-05 0.013125 0.114190 NaN NaN
5 2017-10-06 0.011786 0.114631 NaN NaN
6 2017-10-07 0.010446 0.115071 NaN NaN
7 2017-10-08 0.009107 0.115512 NaN NaN
8 2017-10-09 0.007768 0.115953 NaN NaN
9 2017-10-10 0.006429 0.116393 NaN NaN
10 2017-10-11 0.005089 0.116834 NaN NaN
11 2017-10-12 0.003750 0.117274 0.003750 0.117274
12 2017-10-13 0.001875 0.139415 NaN NaN
13 2017-10-14 0.000000 0.161556 0.000000 0.161556
14 2017-10-15 0.000000 0.146459 NaN NaN
15 2017-10-16 0.000000 0.131361 NaN NaN
16 2017-10-17 0.000000 0.989999 0.000000 0.116264
17 2017-10-18 0.000000 0.412311 NaN NaN
18 2017-10-19 0.000000 0.166264 NaN NaN
19 2017-10-20 0.000000 0.123464 NaN NaN
20 2017-10-21 0.000000 0.149767 NaN NaN
21 2017-10-22 0.000000 0.376455 NaN NaN
22 2017-10-23 0.000000 0.000215 NaN NaN
23 2017-10-24 0.000000 0.940219 NaN NaN
24 2017-10-25 0.000000 0.030352 0.000000 0.030352
25 2017-10-26 0.000000 0.111112 NaN NaN
26 2017-10-27 0.000000 0.002500 NaN NaN
Method to carry out the task
def my_func(my_df):
non_na_index=[] #define empty list
for i in range(len(my_df.iloc[:,[1]])):
if not pd.isnull(my_df.iloc[i,[3]][0]):
non_na_index.append(i) #add indexes of rows that that have non NaN value
sub=np.roll(non_na_index, shift=-1)-non_na_index #subract column in indexes to find row count of NaN
sub=sub[:-1] #get rid of last element (calculation artifact)
for i in reversed(range(len(sub))):
if sub[i]>=5: #identidy indexes with more than 5 NaN in between
b=non_na_index[i+1] #assign end index
a=non_na_index[i]+1 #assign start index
my_df=my_df.drop(my_df.index[[range(a,b)]]) #drop the rows within the range
return(my_df)
execute the function using df
new_df=my_func(df)
new_df=df.drop(['col_na','val_na'],1) # drop the two extra columns
new_df
date col val
0 2017-10-01 0.000000 0.112869
1 2017-10-02 0.017143 0.112869
11 2017-10-12 0.003750 0.117274
12 2017-10-13 0.001875 0.139415
13 2017-10-14 0.000000 0.161556
14 2017-10-15 0.000000 0.146459
15 2017-10-16 0.000000 0.131361
16 2017-10-17 0.000000 0.989999
24 2017-10-25 0.000000 0.030352
25 2017-10-26 0.000000 0.111112
26 2017-10-27 0.000000 0.002500
Upvotes: 1
Reputation: 294338
I'd first identify where the gaps exceeded 5 days. From there, I generate an array that identified groups between such gaps. Finally, I'd use groupby
to turn to daily frequency and interpolate.
# convenience: assign string to variable for easier access
daytype = 'timedelta64[D]'
# define five days for use when evaluating size of gaps
five = np.array(5, dtype=daytype)
# get the size of gaps
deltas = np.diff(df.index.values).astype(daytype)
# identify groups between gaps
groups = np.append(False, deltas > five).cumsum()
# handy function to turn to daily frequency and interpolate
to_daily = lambda x: x.asfreq('D').interpolate()
# and finally...
df.groupby(groups, group_keys=False).apply(to_daily)
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
In the event you want to provide your own interpolation method. You can modify the above like this:
daytype = 'timedelta64[D]'
five = np.array(5, dtype=daytype)
deltas = np.diff(df.index.values).astype(daytype)
groups = np.append(False, deltas > five).cumsum()
# custom interpolation function that takes a dataframe
def my_interpolate(df):
"""This can be whatever you want.
I just provided what will result
in the same thing as before."""
return df.interpolate()
to_daily = lambda x: x.asfreq('D').pipe(my_interpolate)
df.groupby(groups, group_keys=False).apply(to_daily)
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
Upvotes: 10
Reputation: 3130
Is this what you want?
data0 = """2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264"""
data = [row.split(' ') for row in data0.split('\n')]
df = pd.DataFrame(data, columns = ['date','col_1','vals'])
df.date = pd.to_datetime(df.date)
last_observation = df.assign(last_observation = df.date.diff().dt.days)
df.set_index(['date'], inplace = True)
all_dates = pd.date_range(start = last_observation.date.min(),
end = last_observation.date.max())
df_interpolated = df.reindex(all_dates).astype(np.float64).interpolate()
df_interpolated = df_interpolated.join(last_observation.set_index('date').last_observation)
df_interpolated['discard'] = (df_interpolated.last_observation.bfill() > 5) & df_interpolated.last_observation.isnull()
df_interpolated[['col_1','vals']] = df_interpolated[['col_1','vals']].where(~df_interpolated.discard)
The output is:
col_1 vals last_observation discard
2017-10-01 0.000000 0.112869 NaN False
2017-10-02 0.017143 0.112869 1.0 False
2017-10-03 NaN NaN NaN True
2017-10-04 NaN NaN NaN True
2017-10-05 NaN NaN NaN True
2017-10-06 NaN NaN NaN True
2017-10-07 NaN NaN NaN True
2017-10-08 NaN NaN NaN True
2017-10-09 NaN NaN NaN True
2017-10-10 NaN NaN NaN True
2017-10-11 NaN NaN NaN True
2017-10-12 0.003750 0.117274 10.0 False
2017-10-13 0.001875 0.139415 NaN False
2017-10-14 0.000000 0.161556 2.0 False
2017-10-15 0.000000 0.146459 NaN False
2017-10-16 0.000000 0.131361 NaN False
2017-10-17 0.000000 0.116264 3.0 False
The idea is that you generate the interpolation first (as you did), then decide which observations to drop. Start by assigning the number of days between the current observation and the last. Since you want to discard entries where this number exceeds 5, and the ones prior to it, use .bfill
to give assign this number to the prior interpolations before comparing to 5. Notice however, that for positive discard decisions, the observation would be discarded, which you don't want. So you need to include the condition that you don't discard observations, which you check with the .notnull()
method on the last_observation
column.
Finally, use the .where
method to keep entries that do not meet the discard criterion; by default, the others are replaced by NAs.
Upvotes: 1
Reputation: 5696
If I understood correct, you can remove the unnecessary rows by boolean indexing. Assuming that you have the difference in days in a column called diff
, you can use df.loc[df['diff'].dt.days < 5]
Here's a demo
df = pd.read_clipboard()
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264
Converting to a time column and obtaining a new column for the difference to the next value in days
df = df.reset_index()
df['index']=pd.to_datetime(df['index'])
df['diff'] = df['index'] - df['index'].shift(1)
index col_1 vals diff
0 2017-10-01 0.000000 0.112869 NaT
1 2017-10-02 0.017143 0.112869 1 days
2 2017-10-12 0.003750 0.117274 10 days
3 2017-10-14 0.000000 0.161556 2 days
4 2017-10-17 0.000000 0.116264 3 days
Adding a boolian filter
new_df = df.loc[df['diff'].dt.days < 5]
new_df = new_df.drop('diff', axis=1)
new_df.set_index('index', inplace=True)
new_df
col_1 vals
index
2017-10-02 0.017143 0.112869
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264
Upvotes: 1