Reputation: 628
I am trying to calculate the difference in days between when the value in one column (One
) is 1
and value in another column (Value
) is greater than 0
.
df = pd.DataFrame({'Date':['02.01.2017', '03.01.2017', '04.01.2017', '05.01.2017', '01.01.2017', '02.01.2017', '03.01.2017', '02.12.2017', '03.12.2017', '04.12.2017'],
'CustomerId':['02','02','02','02','03','03','03', '05', '05', '05'],
'Value':[0, 0, 10, 100, 0, 10000, 10000, 0, 0, 12312312],
'One':[1, 1, 0, 0, 1, 0, 0, 1, 0, 0]})
def dayDiff(groupby):
if (not (groupby['One'] == 1).any()) or (not (groupby['Value'] > 0).any()):
return np.zeros(groupby['Date'].count())
min_date = groupby[groupby['One'] == 1]['Date'].iloc[0]
max_date = groupby[groupby['Value'] > 0]['Date'].iloc[0]
delta = max_date - min_date
return np.where(groupby['Value'] > 0 , delta.days, 0)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
DateDiff = df.groupby('CustomerId').apply(dayDiff).explode().rename('DateDiff').reset_index(drop=True)
df = pd.concat([df, DateDiff], axis=1)
df
The result is:
Date CustomerId Value One DateDiff
0 2017-01-02 02 0 1 0
1 2017-01-03 02 0 1 0
2 2017-01-04 02 10 0 2
3 2017-01-05 02 100 0 2
4 2017-01-01 03 0 1 0
5 2017-01-02 03 10000 0 1
6 2017-01-03 03 10000 0 1
7 2017-12-02 05 0 1 0
8 2017-12-03 05 0 0 0
9 2017-12-04 05 12312312 0 2
The problem is that row 2 displays the wrong value. I expect it to display value 1
, and row 6 to display 2
. Because I want to calculate the difference in days between the last 1
value in One
with respective to the customer when Value
is greater than zero. It seems like dayDiff()
calculates the same difference in days regardless of the dates.
I tried changing iloc[0]
values, but the result wasn't entirely correct.
Expectations (Notice that row 2 and 6 of DateDiff
are now correct):
Date CustomerId Value One DateDiff
0 2017-01-02 02 0 1 0
1 2017-01-03 02 0 1 0
2 2017-01-04 02 10 0 1
3 2017-01-05 02 100 0 2
4 2017-01-01 03 0 1 0
5 2017-01-02 03 10000 0 1
6 2017-01-03 03 10000 0 2
7 2017-12-02 05 0 1 0
8 2017-12-03 05 0 0 0
9 2017-12-04 05 12312312 0 2
Edit: Using @jezrael's suggestions, I realised there is a problem when there are multiple 1
's beyond. The days become negative. I want row 2
to display 0 because 2017-01-04
- 2017-01-04
should be zero, as it is the last date. In other words, the last date before or the same date.
df = pd.DataFrame({'Date':['02.01.2017', '03.01.2017', '04.01.2017', '05.01.2017', '01.01.2017', '02.01.2017', '03.01.2017', '02.12.2017', '03.12.2017', '04.12.2017'],
'CustomerId':['02','02','02','02','03','03','03', '05', '05', '05'],
'Value':[0, 0, 10, 100, 0, 10000, 10000, 0, 0, 12312312],
'One':[1, 1, 1, 1, 1, 0, 0, 1, 0, 0]})
Date CustomerId Value One DateDiff
0 2017-01-02 02 0 1 0
1 2017-01-03 02 0 1 0
2 2017-01-04 02 10 1 -1
3 2017-01-05 02 100 1 0
4 2017-01-01 03 0 1 0
5 2017-01-02 03 10000 0 1
6 2017-01-03 03 10000 0 2
7 2017-12-02 05 0 1 0
8 2017-12-03 05 0 0 0
9 2017-12-04 05 12312312 0 2
Upvotes: 1
Views: 119
Reputation: 863451
I believe you need difference of last value Date
with of One == 1
with all values with Value > 0
per groups:
def dayDiff(groupby):
if (not (groupby['One'] == 1).any()) or (not (groupby['Value'] > 0).any()):
groupby['DateDiff'] = 0
return groupby
min_date = groupby.loc[groupby['One'] == 1, 'Date'].iloc[-1]
max_date = groupby.loc[groupby['Value'] > 0, 'Date']
delta = max_date - min_date
groupby['DateDiff'] = delta.dt.days.reindex(groupby.index, fill_value=0)
return groupby
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.groupby('CustomerId').apply(dayDiff)
print (df)
Date CustomerId Value One DateDiff
0 2017-01-02 02 0 1 0
1 2017-01-03 02 0 1 0
2 2017-01-04 02 10 0 1
3 2017-01-05 02 100 0 2
4 2017-01-01 03 0 1 0
5 2017-01-02 03 10000 0 1
6 2017-01-03 03 10000 0 2
7 2017-12-02 05 0 1 0
8 2017-12-03 05 0 0 0
9 2017-12-04 05 12312312 0 2
EDIT: Another idea is filter rows before groupby
by mask and then append
rows which not match:
def dayDiff(groupby):
if (not (groupby['One'] == 1).any()) or (not (groupby['Value'] > 0).any()):
groupby['DateDiff'] = 0
return groupby
min_date = groupby.loc[groupby['One'] == 1, 'Date'].iloc[-1]
max_date = groupby.loc[groupby['Value'] > 0, 'Date']
delta = max_date - min_date
groupby['DateDiff'] = delta.dt.days.reindex(groupby.index, fill_value=0)
return groupby
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
m1 = (df['One'] == 1) & (df['Value'] <= 0)
m2 = (df['Value'] > 0) & (df['One'] != 1)
mask = m1 | m2
df = df[mask].groupby('CustomerId').apply(dayDiff).append(df[~mask], sort=False).sort_index()
df['DateDiff'] = df['DateDiff'].fillna(0).astype(int)
print (df)
Date CustomerId Value One DateDiff
0 2017-01-02 02 0 1 0
1 2017-01-03 02 0 1 0
2 2017-01-04 02 10 1 0
3 2017-01-05 02 100 1 0
4 2017-01-01 03 0 1 0
5 2017-01-02 03 10000 0 1
6 2017-01-03 03 10000 0 2
7 2017-12-02 05 0 1 0
8 2017-12-03 05 0 0 0
9 2017-12-04 05 12312312 0 2
Upvotes: 1
Reputation: 25269
Use idxmin
with transform
to find location of first 0
, subtract 1
is location of last 1
in each group and assign to s
. Slicing on s
and subtracting from df.Date
. Finally, use where
to flip to 0
on df.Value=0
df['Date'] = pd.to_datetime(df.Date, format='%d.%m.%Y')
s = df.groupby('CustomerId').One.transform('idxmin').sub(1)
s1 = df.Date.sub(df.loc[s, ['Date']].reset_index(drop=True)['Date']).dt.days
df['DateDiff'] = s1.where(df.Value.gt(0), 0)
Out[363]:
Date CustomerId Value One DateDiff
0 2017-01-02 02 0 1 0
1 2017-01-03 02 0 1 0
2 2017-01-04 02 10 0 1
3 2017-01-05 02 100 0 2
4 2017-01-01 03 0 1 0
5 2017-01-02 03 10000 0 1
6 2017-01-03 03 10000 0 2
7 2017-12-02 05 0 1 0
8 2017-12-03 05 0 0 0
9 2017-12-04 05 12312312 0 2
Upvotes: 1