Reputation: 23
After 24 hours of meaningless tries I don't have another word than "Help!" I have the dataframe contains 2 columns: 'uid' int, 'course_id' as a tuple of 2 or more other ids (int).
uid | course_id |
---|---|
1 | (1,2) |
2 | (1,2) |
3 | (1,3) |
4 | (1,2,3) |
5 | (2,3) |
A kind of this but with 13k of rows. Although original df where I have ungrouped lines uid-course_id. Like
uid | course_id |
---|---|
1 | 1 |
1 | 2 |
2 | 1 |
etc. I need another dataframe with all unique pairs of course_id and number of users that have these pairs in the second column. Like this but bigger (4k of rows):
pair | users |
---|---|
(1,2) | 3 |
(1,3) | 2 |
(2,3) | 2 |
I can make a column of all unique pairs but I have no idea how to count users. The closest function was
def user_count(pair):
total_users = df[
np.isin(df['course_id'], pair)
].groupby(
'uid', as_index=False
).filter(
lambda x: len(x)==2
).groupby(
'course_id', as_index=False
).count()['uid'][0]
return total_users
for original dataframe and in works when I give an exact tuple, but it doesn't work with df.apply(user_count). Is it easier to use the original df or filtered and combined to tuples? And how can I count all the users with every pair?
Upvotes: 2
Views: 67
Reputation: 71689
>>> df # original df
uid course_id
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 3 3
6 4 1
7 4 2
8 4 3
9 5 2
10 5 3
from itertools import combinations
onehot = pd.get_dummies(df['course_id'])\
.set_index(df['uid']).sum(level=0)
cnt = {}
for c in combinations(onehot, r=2):
cnt[c] = onehot[[*c]].ge(1).all(1).sum()
counts = pd.DataFrame(cnt.items(), columns=['pairs', 'count'])
One hot encode the couse_id
column to create indicator variables for each unique value in course_id
, then reduce the encoded variables along axis=0
by taking sum
on level=0
>>> onehot
1 2 3
uid
1 1 1 0
2 1 1 0
3 1 0 1
4 1 1 1
5 0 1 1
Now iterate over all the pairs obtain from the combinations of unique values in course_id
column, and for each pair calculate the number of rows in the onehot encoded dataframe where these pairs are present
>>> cnt
{(1, 2): 3, (1, 3): 2, (2, 3): 2}
Now create a new dataframe from the above dictionary which contains the counts of unique users for every possible combination of values in course_id
>>> counts
pairs count
0 (1, 2) 3
1 (1, 3) 2
2 (2, 3) 2
Upvotes: 1
Reputation: 23217
Assuming you already have a column of all unique pairs, here is a way you can count users:
You can use .map()
and count number of items in pair
as subset of course_id
, as follows:
df2['users'] = df2['pair'].map(lambda x: len([y for y in df1['course_id'].values if set(x).issubset(y)]))
print(df2)
pair users
0 (1, 2) 3
1 (1, 3) 2
2 (2, 3) 2
One caveat of this solution maybe on the performance issue for large number of rows. I haven't tested it out yet. You may just try and see if it can fulfill your system performance consideration.
Upvotes: 0