Reputation: 171
I am currently running into a problem and hoping that someone could assist. Currently have 2 dataframes of items which are hundreds of thousands of lines long. (one has over 200k and one over 180k). the larger of the 2 dataframes is going to contain unique values of users, while the smaller one does not so for example:
df1:
user1
user2
user3
user4
user5
df2:
user1
user1
user5
user4
user5
user5
What i need to do is take each user from df1 and efficiently see if it is in df2 and how many times it occurs.
Thanks!
Upvotes: 7
Views: 4675
Reputation: 294218
Use pd.factorize
and np.bincount
to efficiently produce the value counts
f, u = pd.factorize(df2.user)
d = dict(zip(u, np.bincount(f)))
df1['Counts'] = df1.user.map(d)
df1
user Counts
0 user1 2.0
1 user2 NaN
2 user3 NaN
3 user4 1.0
4 user5 3.0
df1 = pd.Series(list('12345')).radd('user').to_frame('user')
df2 = pd.Series(list('115455')).radd('user').to_frame('user')
Upvotes: 3
Reputation: 323226
Using value_counts
df1['Newcount']=df1['df1:'].map(df2['df2:'].value_counts())
df1
Out[117]:
df1: Newcount
0 user1 2.0
1 user2 NaN
2 user3 NaN
3 user4 1.0
4 user5 3.0
Upvotes: 13
Reputation: 76297
Assuming the relevant column in each DataFrame is called 'user'
, you can use
pd.merge(
df1,
df2.user.groupby(df2.user).count(),
left_on='user',
right_index=True,
how='left')
Explanation:
The groupby
+ count
will find the number of occurrences of each user. It will create a DataFrame whose index is the user, and the value is the count.
The merge left-merges the resulting DataFrame onto df1
.
Upvotes: 2