Reputation: 13660
I have two columns. I want to check if the difference between them is between 0 and 10 days. One of the fields often contains null values.
df['Diff'] = (df['Dt1'] - df['Dt2'])
def wdw(x):
if pd.notnull(x):
if type(x) !=long:
if type(timedelta(days=10)) != long:
if x > timedelta(days=10):
return 1
else:
return 0
df['Diff'].df(wdw)
When I run this I get the following error.
TypeError: can't compare datetime.timedelta to long
When I look at the values of df['Diff'] They all appear to be timedeltas. Any idea what is going on here? It seems like creating an indicator based on the difference between two date fields should be easier than this...
Upvotes: 1
Views: 3437
Reputation: 4233
use assign to create a diff date dt1 and dt2 column. Then use timedelta to get a 0 day and 10 day variable to compare diff then mask the results for output.
df = pd.DataFrame({'Dt1':pd.date_range('2010-1-1', freq='5D', periods=10),
'Dt2':pd.date_range('2010-1-2', freq='3D', periods=10)})
df.iloc[::3, 1] = np.nan
print(df)
zero_days=timedelta(days=0)
ten_days=timedelta(days=10)
print(zero_days,ten_days)
df['Diff']=np.empty(len(df))
df=df.assign(Diff=lambda row: row['Dt1']-row['Dt2'])
mask=(df['Diff'] >=zero_days)&(df['Diff'] <=ten_days)
print(df[mask])
output:
Dt1 Dt2 Diff
1 2010-01-06 2010-01-05 1 days
2 2010-01-11 2010-01-08 3 days
4 2010-01-21 2010-01-14 7 days
5 2010-01-26 2010-01-17 9 days
Upvotes: 0
Reputation: 879143
The values in df['Diff']
are numpy timedelta64s. You can compare them with pd.Timedelta
s; see below.
Moreover, you do not need to call df['Diff'].apply(wdw)
, which calls wdw
for each value in the Series; you can compare whole Series with a pd.Timedelta
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Dt1':pd.date_range('2010-1-1', freq='5D', periods=10),
'Dt2':pd.date_range('2010-1-2', freq='3D', periods=10)})
df.iloc[::3, 1] = np.nan
df['Diff'] = df['Dt1'] - df['Dt2']
print(df)
# Dt1 Dt2 Diff
# 0 2010-01-01 NaT NaT
# 1 2010-01-06 2010-01-05 1 days
# 2 2010-01-11 2010-01-08 3 days
# 3 2010-01-16 NaT NaT
# 4 2010-01-21 2010-01-14 7 days
# 5 2010-01-26 2010-01-17 9 days
# 6 2010-01-31 NaT NaT
# 7 2010-02-05 2010-01-23 13 days
# 8 2010-02-10 2010-01-26 15 days
# 9 2010-02-15 NaT NaT
mask = (df['Diff'] < pd.Timedelta(days=10)) & (pd.Timedelta(days=0) < df['Diff'])
print(mask)
yields
0 False
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 False
Name: Diff, dtype: bool
pd.Timedelta
s were introduced in Pandas v.0.15. Here is a workaround for older version of Pandas which uses np.timedela64s
:
mask = ((df['Diff'] / np.timedelta64(10, 'D') < 10)
& (df['Diff'] / np.timedelta64(10, 'D') > 0))
Upvotes: 2
Reputation: 13660
This also worked but isn't as good as the answer unutbu provided.
def wdw(x):
if pd.notnull(x):
if x/np.timedelta64(1,'D') <= 10:
if x/np.timedelta64(1,'D') >0:
return 1
else:
return 0
df['Diff'].df(wdw)
Upvotes: 1