Ajinkya Deshmukh
Ajinkya Deshmukh

Reputation: 53

Calculate day's difference between successive pandas dataframe rows with condition

I have a dataframe as following:

Company     Date        relTweet    GaplastRel
XYZ         3/2/2020    1               
XYZ         3/3/2020    1   
XYZ         3/4/2020    1   
XYZ         3/5/2020    1   
XYZ         3/5/2020    0   
XYZ         3/6/2020    1   
XYZ         3/8/2020    1   
ABC         3/9/2020    0   
ABC         3/10/2020   1   
ABC         3/11/2020   0   
ABC         3/12/2020   1   

The relTweet displays whether the tweet is relevant (1) or not (0). \nI need to find the days difference (GaplastRel) between each successive rows for each company, with a condition that the previous day's tweet should be relevant tweet (i.e. relTweet =1 ). e.g. For the first record relTweet should be 0. For the 2nd record, relTweet should be 1 as the last relevant tweet was made one day ago.

Below is the example of needed output:

Company     Date        relTweet    GaplastRel
XYZ         3/2/2020    1               0
XYZ         3/3/2020    1               1
XYZ         3/4/2020    1               1
XYZ         3/5/2020    1               1
XYZ         3/5/2020    0               1
XYZ         3/6/2020    1               1
XYZ         3/8/2020    1               2
ABC         3/9/2020    0               0
ABC         3/10/2020   1               0
ABC         3/11/2020   0               1
ABC         3/12/2020   1               2

Following is my code:

dataDf['Date'] = pd.to_datetime(dataDf['Date'], format='%m/%d/%Y')

dataDf['relTweet'] = (dataDf.groupby('Company', group_keys=False)
                .apply(lambda g: g['Date'].diff().replace(0, np.nan).ffill()))

This code gives the days difference between successive rows for each company without conisidering the relTweet =1 condition. I am not sure how to apply the condition. Following is the output of the above code:

Company     Date        relTweet    GaplastRel
XYZ         3/2/2020    1               NaT
XYZ         3/3/2020    1               1 days
XYZ         3/4/2020    1               1 days
XYZ         3/5/2020    1               1 days
XYZ         3/5/2020    0               0 days
XYZ         3/6/2020    1               1 days
XYZ         3/8/2020    1               2 days
ABC         3/9/2020    0               NaT
ABC         3/10/2020   1               1 days
ABC         3/11/2020   0               1 days
ABC         3/12/2020   1               1 days

Upvotes: 3

Views: 102

Answers (1)

BENY
BENY

Reputation: 323226

Change your mind sometime we need merge_asof rather than groupby

df1=df.loc[df['relTweet']==1,['Company','Date']]
df=pd.merge_asof(df,df1.assign(Date1=df1.Date),by='Company',on='Date', allow_exact_matches=False)
df['GaplastRel']=(df.Date-df.Date1).dt.days.fillna(0)
df
Out[31]: 
   Company       Date  relTweet      Date1  GaplastRel
0      XYZ 2020-03-02         1        NaT         0.0
1      XYZ 2020-03-03         1 2020-03-02         1.0
2      XYZ 2020-03-04         1 2020-03-03         1.0
3      XYZ 2020-03-05         1 2020-03-04         1.0
4      XYZ 2020-03-05         0 2020-03-04         1.0
5      XYZ 2020-03-06         1 2020-03-05         1.0
6      XYZ 2020-03-08         1 2020-03-06         2.0
7      ABC 2020-03-09         0        NaT         0.0
8      ABC 2020-03-10         1        NaT         0.0
9      ABC 2020-03-11         0 2020-03-10         1.0
10     ABC 2020-03-12         1 2020-03-10         2.0

Upvotes: 2

Related Questions