How to make a function for dataframe that checks if big tuple contains all the elements of the small one?

Question

After 24 hours of meaningless tries I don't have another word than "Help!" I have the dataframe contains 2 columns: 'uid' int, 'course_id' as a tuple of 2 or more other ids (int).

uid	course_id
1	(1,2)
2	(1,2)
3	(1,3)
4	(1,2,3)
5	(2,3)

A kind of this but with 13k of rows. Although original df where I have ungrouped lines uid-course_id. Like

uid	course_id
1	1
1	2
2	1

etc. I need another dataframe with all unique pairs of course_id and number of users that have these pairs in the second column. Like this but bigger (4k of rows):

pair	users
(1,2)	3
(1,3)	2
(2,3)	2

I can make a column of all unique pairs but I have no idea how to count users. The closest function was

def user_count(pair):
        total_users = df[
            np.isin(df['course_id'], pair)
        ].groupby(
            'uid', as_index=False
        ).filter(
            lambda x: len(x)==2
        ).groupby(
            'course_id', as_index=False
        ).count()['uid'][0]
        return total_users

for original dataframe and in works when I give an exact tuple, but it doesn't work with df.apply(user_count). Is it easier to use the original df or filtered and combined to tuples? And how can I count all the users with every pair?

Shubham Sharma · Accepted Answer

Setup

>>> df # original df

    uid course_id
0     1         1
1     1         2
2     2         1
3     2         2
4     3         1
5     3         3
6     4         1
7     4         2
8     4         3
9     5         2
10    5         3

Solution

from itertools import combinations

onehot = pd.get_dummies(df['course_id'])\
           .set_index(df['uid']).sum(level=0)

cnt = {}
for c in combinations(onehot, r=2):
    cnt[c] = onehot[[*c]].ge(1).all(1).sum()

counts = pd.DataFrame(cnt.items(), columns=['pairs', 'count'])

Explanations

One hot encode the couse_id column to create indicator variables for each unique value in course_id, then reduce the encoded variables along axis=0 by taking sum on level=0

>>> onehot

     1  2  3
uid         
1    1  1  0
2    1  1  0
3    1  0  1
4    1  1  1
5    0  1  1

Now iterate over all the pairs obtain from the combinations of unique values in course_id column, and for each pair calculate the number of rows in the onehot encoded dataframe where these pairs are present

>>> cnt

{(1, 2): 3, (1, 3): 2, (2, 3): 2}

Now create a new dataframe from the above dictionary which contains the counts of unique users for every possible combination of values in course_id

>>> counts

    pairs  count
0  (1, 2)      3
1  (1, 3)      2
2  (2, 3)      2

How to make a function for dataframe that checks if big tuple contains all the elements of the small one?

Answers (2)

Setup

Solution

Explanations

Related Questions