Reputation: 29511
I have the following dataframe:
userid date
1 2010-01-03
2 2009-01-04
3 2004-02-03
1 2007-01-01
I want to create a column that outputs true if there is another similar userid in another row with an earlier date. i.e.
userid date userid_seen
1 2010-01-03 t
2 2009-01-04 f
3 2004-02-03 f
1 2007-01-01 f
How do I do it?
Upvotes: 2
Views: 127
Reputation: 20477
The way I would do this is to calculate the earliest date for each user ID, and then check whether the row has a more recent date. Assuming df
is your DataFrame:
min_date = pd.DataFrame(df.groupby('userid')['date'].agg({'min_date': min}))
df = df.merge(min_date, left_on='userid', right_index=True)
df['userid_seen'] = df.date > df.min_date
df = df[['userid', 'date', 'userid_seen']] # get rid of the 'min_date' column
Upvotes: 2
Reputation: 393963
This will work by calling apply
and passing param axis=1
to apply it row-wise:
In [88]:
def func(x):
if len(df.loc[(df['userid'] == x.userid) & (df['date'] != x.date), 'date']) > 0:
return (df.loc[(df['userid'] == x.userid) & (df['date'] != x.date), 'date'] < x.date).values.max()
return False
df['user_id_seen'] = df.apply(lambda row: func(row), axis=1)
df
Out[88]:
userid date user_id_seen
0 1 2010-01-03 True
1 2 2009-01-04 False
2 3 2004-02-03 False
3 1 2007-01-01 False
UPDATE
Although the above works it will be slow for large dataframes as @MattiJohn correctly points out as this effectively iterates over each row.
The following is a more compact answer similar to @MattiJohn's answer:
In [102]:
df['user_id_seen'] = df.groupby('userid')['date'].transform('min') < df.date
df
Out[102]:
userid date user_id_seen
0 1 2010-01-03 True
1 2 2009-01-04 False
2 3 2004-02-03 False
3 1 2007-01-01 False
Upvotes: 3