Jupyter
Jupyter

Reputation: 131

How to compare value of second column with same values of first column in pandas dataframe?

How to extract and compare values of second column in a data frame for all the same values of first column in same data frame?

I have a data frame as 'df':

Name         Datetime
Bob          26-04-2018 12:00:00
Claire       26-04-2018 12:00:00
Bob          26-04-2018 12:30:00
Grace        27-04-2018 08:30:00
Bob          27-04-2018 09:30:00

I want to add a new column to the data frame as df['Id'] such that, for users having same names, if the datetime values have a difference of not more than 30 min, they would be assigned the same value of Id, and if the datetime difference is greater than 30 min, then it would be assigned a different id.

I think it could be achieved using iterating over loops but I am not sure how to do it. Also, is there a better way to do this as I have a huge data set?

My expected output of the data frame would be as:

Name         Datetime                 Id
Bob          26-04-2018 12:00:00      1
Claire       26-04-2018 12:00:00      2
Bob          26-04-2018 12:10:00      1
Bob          26-04-2018 12:20:00      1
Claire       27-04-2018 08:30:00      3
Bob          27-04-2018 09:30:00      4

Any help would be appreciated. Thanks

Upvotes: 1

Views: 1218

Answers (2)

Andy L.
Andy L.

Reputation: 25239

I think it is simple using groupby, grouper and ngroup as follows:

df['Id'] = df.groupby([pd.Grouper(freq='30T', key='Datetime'), 'Name']).ngroup().add(1)


Out[423]:
     Name            Datetime  Id
0     Bob 2018-04-26 12:00:00   1
1  Claire 2018-04-26 12:00:00   2
2     Bob 2018-04-26 12:10:00   1
3     Bob 2018-04-26 12:20:00   1
4  Claire 2018-04-27 08:30:00   3
5     Bob 2018-04-27 09:30:00   4

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 148890

I would sort the dataframe on Name, Datetime to identify the different groups, then assign each group an Id value in original Dataframe order.

Code could be:

# sort data frame on Name and datetime
df.sort_values(['Name', 'Datetime'], inplace=True)
df1 = df.shift()
# identify new Ids
df.loc[(df1.Name!=df.Name)
       |(df.Datetime-df1.Datetime>pd.Timedelta(minutes=30)), 'tmp'] = 1
del df1   # non longer usefull

# ok, one different tmp value for each group
df['tmp'] = df['tmp'].cumsum().ffill()

# compute Ids in original dataframe orders
ids = pd.DataFrame(df['tmp'].drop_duplicates().sort_index())
ids['Id'] = ids.reset_index(drop=True).index + 1

# and get the expected result
df = df.reset_index().merge(ids, on='tmp').set_index('index').sort_index()\
     .drop(columns='tmp').rename_axis(None)

It gives as expected:

     Name            Datetime  Id
0     Bob 2018-04-26 12:00:00   1
1  Claire 2018-04-26 12:00:00   2
2     Bob 2018-04-26 12:10:00   1
3     Bob 2018-04-26 12:20:00   1
4  Claire 2018-04-27 08:30:00   3
5     Bob 2018-04-27 09:30:00   4

Upvotes: 2

Related Questions