Reputation: 131
How to extract and compare values of second column in a data frame for all the same values of first column in same data frame?
I have a data frame as 'df':
Name Datetime
Bob 26-04-2018 12:00:00
Claire 26-04-2018 12:00:00
Bob 26-04-2018 12:30:00
Grace 27-04-2018 08:30:00
Bob 27-04-2018 09:30:00
I want to add a new column to the data frame as df['Id'] such that, for users having same names, if the datetime values have a difference of not more than 30 min, they would be assigned the same value of Id, and if the datetime difference is greater than 30 min, then it would be assigned a different id.
I think it could be achieved using iterating over loops but I am not sure how to do it. Also, is there a better way to do this as I have a huge data set?
My expected output of the data frame would be as:
Name Datetime Id
Bob 26-04-2018 12:00:00 1
Claire 26-04-2018 12:00:00 2
Bob 26-04-2018 12:10:00 1
Bob 26-04-2018 12:20:00 1
Claire 27-04-2018 08:30:00 3
Bob 27-04-2018 09:30:00 4
Any help would be appreciated. Thanks
Upvotes: 1
Views: 1218
Reputation: 25239
I think it is simple using groupby
, grouper
and ngroup
as follows:
df['Id'] = df.groupby([pd.Grouper(freq='30T', key='Datetime'), 'Name']).ngroup().add(1)
Out[423]:
Name Datetime Id
0 Bob 2018-04-26 12:00:00 1
1 Claire 2018-04-26 12:00:00 2
2 Bob 2018-04-26 12:10:00 1
3 Bob 2018-04-26 12:20:00 1
4 Claire 2018-04-27 08:30:00 3
5 Bob 2018-04-27 09:30:00 4
Upvotes: 1
Reputation: 148890
I would sort the dataframe on Name, Datetime to identify the different groups, then assign each group an Id value in original Dataframe order.
Code could be:
# sort data frame on Name and datetime
df.sort_values(['Name', 'Datetime'], inplace=True)
df1 = df.shift()
# identify new Ids
df.loc[(df1.Name!=df.Name)
|(df.Datetime-df1.Datetime>pd.Timedelta(minutes=30)), 'tmp'] = 1
del df1 # non longer usefull
# ok, one different tmp value for each group
df['tmp'] = df['tmp'].cumsum().ffill()
# compute Ids in original dataframe orders
ids = pd.DataFrame(df['tmp'].drop_duplicates().sort_index())
ids['Id'] = ids.reset_index(drop=True).index + 1
# and get the expected result
df = df.reset_index().merge(ids, on='tmp').set_index('index').sort_index()\
.drop(columns='tmp').rename_axis(None)
It gives as expected:
Name Datetime Id
0 Bob 2018-04-26 12:00:00 1
1 Claire 2018-04-26 12:00:00 2
2 Bob 2018-04-26 12:10:00 1
3 Bob 2018-04-26 12:20:00 1
4 Claire 2018-04-27 08:30:00 3
5 Bob 2018-04-27 09:30:00 4
Upvotes: 2