Mataunited18
Mataunited18

Reputation: 628

Calculate difference in days given values from two columns

I am trying to calculate the difference in days between when the value in one column (One) is 1 and value in another column (Value) is greater than 0.

df = pd.DataFrame({'Date':['02.01.2017', '03.01.2017', '04.01.2017', '05.01.2017', '01.01.2017', '02.01.2017', '03.01.2017', '02.12.2017', '03.12.2017', '04.12.2017'],
                   'CustomerId':['02','02','02','02','03','03','03', '05', '05', '05'],
                   'Value':[0, 0, 10, 100, 0, 10000, 10000, 0, 0, 12312312],
                   'One':[1, 1, 0, 0, 1, 0, 0, 1, 0, 0]})

def dayDiff(groupby):
    if (not (groupby['One'] == 1).any()) or (not (groupby['Value'] > 0).any()):
        return np.zeros(groupby['Date'].count())

    min_date = groupby[groupby['One'] == 1]['Date'].iloc[0]
    max_date = groupby[groupby['Value'] > 0]['Date'].iloc[0]
    delta = max_date - min_date
    return np.where(groupby['Value'] > 0 , delta.days, 0)


df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
DateDiff = df.groupby('CustomerId').apply(dayDiff).explode().rename('DateDiff').reset_index(drop=True)
df = pd.concat([df, DateDiff], axis=1)
df

The result is:

          Date  CustomerId     Value    One DateDiff
0   2017-01-02          02         0    1   0
1   2017-01-03          02         0    1   0
2   2017-01-04          02        10    0   2
3   2017-01-05          02       100    0   2
4   2017-01-01          03         0    1   0
5   2017-01-02          03     10000    0   1
6   2017-01-03          03     10000    0   1
7   2017-12-02          05         0    1   0
8   2017-12-03          05         0    0   0
9   2017-12-04          05  12312312    0   2

The problem is that row 2 displays the wrong value. I expect it to display value 1, and row 6 to display 2. Because I want to calculate the difference in days between the last 1 value in One with respective to the customer when Value is greater than zero. It seems like dayDiff() calculates the same difference in days regardless of the dates.

I tried changing iloc[0] values, but the result wasn't entirely correct.

Expectations (Notice that row 2 and 6 of DateDiff are now correct):

          Date  CustomerId     Value    One DateDiff
0   2017-01-02          02         0    1   0
1   2017-01-03          02         0    1   0
2   2017-01-04          02        10    0   1
3   2017-01-05          02       100    0   2
4   2017-01-01          03         0    1   0
5   2017-01-02          03     10000    0   1
6   2017-01-03          03     10000    0   2
7   2017-12-02          05         0    1   0
8   2017-12-03          05         0    0   0
9   2017-12-04          05  12312312    0   2

Edit: Using @jezrael's suggestions, I realised there is a problem when there are multiple 1's beyond. The days become negative. I want row 2 to display 0 because 2017-01-04 - 2017-01-04 should be zero, as it is the last date. In other words, the last date before or the same date.

df = pd.DataFrame({'Date':['02.01.2017', '03.01.2017', '04.01.2017', '05.01.2017', '01.01.2017', '02.01.2017', '03.01.2017', '02.12.2017', '03.12.2017', '04.12.2017'],
                   'CustomerId':['02','02','02','02','03','03','03', '05', '05', '05'],
                   'Value':[0, 0, 10, 100, 0, 10000, 10000, 0, 0, 12312312],
                   'One':[1, 1, 1, 1, 1, 0, 0, 1, 0, 0]})

        Date CustomerId     Value  One  DateDiff
0 2017-01-02         02         0    1         0
1 2017-01-03         02         0    1         0
2 2017-01-04         02        10    1        -1
3 2017-01-05         02       100    1         0
4 2017-01-01         03         0    1         0
5 2017-01-02         03     10000    0         1
6 2017-01-03         03     10000    0         2
7 2017-12-02         05         0    1         0
8 2017-12-03         05         0    0         0
9 2017-12-04         05  12312312    0         2

Upvotes: 1

Views: 119

Answers (2)

jezrael
jezrael

Reputation: 863451

I believe you need difference of last value Date with of One == 1 with all values with Value > 0 per groups:

def dayDiff(groupby):
    if (not (groupby['One'] == 1).any()) or (not (groupby['Value'] > 0).any()):
        groupby['DateDiff'] = 0
        return groupby

    min_date = groupby.loc[groupby['One'] == 1, 'Date'].iloc[-1]
    max_date = groupby.loc[groupby['Value'] > 0, 'Date']
    delta = max_date - min_date
    groupby['DateDiff'] = delta.dt.days.reindex(groupby.index, fill_value=0)
    return groupby

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.groupby('CustomerId').apply(dayDiff)
print (df)
        Date CustomerId     Value  One  DateDiff
0 2017-01-02         02         0    1         0
1 2017-01-03         02         0    1         0
2 2017-01-04         02        10    0         1
3 2017-01-05         02       100    0         2
4 2017-01-01         03         0    1         0
5 2017-01-02         03     10000    0         1
6 2017-01-03         03     10000    0         2
7 2017-12-02         05         0    1         0
8 2017-12-03         05         0    0         0
9 2017-12-04         05  12312312    0         2

EDIT: Another idea is filter rows before groupby by mask and then append rows which not match:

def dayDiff(groupby):
    if (not (groupby['One'] == 1).any()) or (not (groupby['Value'] > 0).any()):
        groupby['DateDiff'] = 0
        return groupby

    min_date = groupby.loc[groupby['One'] == 1, 'Date'].iloc[-1]
    max_date = groupby.loc[groupby['Value'] > 0, 'Date']
    delta = max_date - min_date
    groupby['DateDiff'] = delta.dt.days.reindex(groupby.index, fill_value=0)
    return groupby

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
m1 = (df['One'] == 1) & (df['Value'] <= 0)
m2 = (df['Value'] > 0) & (df['One'] != 1)
mask = m1 | m2

df = df[mask].groupby('CustomerId').apply(dayDiff).append(df[~mask], sort=False).sort_index()
df['DateDiff'] = df['DateDiff'].fillna(0).astype(int)
print (df)
        Date CustomerId     Value  One  DateDiff
0 2017-01-02         02         0    1         0
1 2017-01-03         02         0    1         0
2 2017-01-04         02        10    1         0
3 2017-01-05         02       100    1         0
4 2017-01-01         03         0    1         0
5 2017-01-02         03     10000    0         1
6 2017-01-03         03     10000    0         2
7 2017-12-02         05         0    1         0
8 2017-12-03         05         0    0         0
9 2017-12-04         05  12312312    0         2

Upvotes: 1

Andy L.
Andy L.

Reputation: 25269

Use idxmin with transform to find location of first 0, subtract 1 is location of last 1 in each group and assign to s. Slicing on s and subtracting from df.Date. Finally, use where to flip to 0 on df.Value=0

df['Date'] = pd.to_datetime(df.Date, format='%d.%m.%Y')
s = df.groupby('CustomerId').One.transform('idxmin').sub(1)
s1 = df.Date.sub(df.loc[s, ['Date']].reset_index(drop=True)['Date']).dt.days

df['DateDiff'] = s1.where(df.Value.gt(0), 0)

Out[363]:
        Date CustomerId     Value  One  DateDiff
0 2017-01-02         02         0    1         0
1 2017-01-03         02         0    1         0
2 2017-01-04         02        10    0         1
3 2017-01-05         02       100    0         2
4 2017-01-01         03         0    1         0
5 2017-01-02         03     10000    0         1
6 2017-01-03         03     10000    0         2
7 2017-12-02         05         0    1         0
8 2017-12-03         05         0    0         0
9 2017-12-04         05  12312312    0         2

Upvotes: 1

Related Questions