Alex
Alex

Reputation: 23

How to make a function for dataframe that checks if big tuple contains all the elements of the small one?

After 24 hours of meaningless tries I don't have another word than "Help!" I have the dataframe contains 2 columns: 'uid' int, 'course_id' as a tuple of 2 or more other ids (int).

uid course_id
1 (1,2)
2 (1,2)
3 (1,3)
4 (1,2,3)
5 (2,3)

A kind of this but with 13k of rows. Although original df where I have ungrouped lines uid-course_id. Like

uid course_id
1 1
1 2
2 1

etc. I need another dataframe with all unique pairs of course_id and number of users that have these pairs in the second column. Like this but bigger (4k of rows):

pair users
(1,2) 3
(1,3) 2
(2,3) 2

I can make a column of all unique pairs but I have no idea how to count users. The closest function was

def user_count(pair):
        total_users = df[
            np.isin(df['course_id'], pair)
        ].groupby(
            'uid', as_index=False
        ).filter(
            lambda x: len(x)==2
        ).groupby(
            'course_id', as_index=False
        ).count()['uid'][0]
        return total_users

for original dataframe and in works when I give an exact tuple, but it doesn't work with df.apply(user_count). Is it easier to use the original df or filtered and combined to tuples? And how can I count all the users with every pair?

Upvotes: 2

Views: 67

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Setup

>>> df # original df

    uid course_id
0     1         1
1     1         2
2     2         1
3     2         2
4     3         1
5     3         3
6     4         1
7     4         2
8     4         3
9     5         2
10    5         3

Solution

from itertools import combinations

onehot = pd.get_dummies(df['course_id'])\
           .set_index(df['uid']).sum(level=0)

cnt = {}
for c in combinations(onehot, r=2):
    cnt[c] = onehot[[*c]].ge(1).all(1).sum()

counts = pd.DataFrame(cnt.items(), columns=['pairs', 'count'])

Explanations

One hot encode the couse_id column to create indicator variables for each unique value in course_id, then reduce the encoded variables along axis=0 by taking sum on level=0

>>> onehot

     1  2  3
uid         
1    1  1  0
2    1  1  0
3    1  0  1
4    1  1  1
5    0  1  1

Now iterate over all the pairs obtain from the combinations of unique values in course_id column, and for each pair calculate the number of rows in the onehot encoded dataframe where these pairs are present

>>> cnt

{(1, 2): 3, (1, 3): 2, (2, 3): 2}

Now create a new dataframe from the above dictionary which contains the counts of unique users for every possible combination of values in course_id

>>> counts

    pairs  count
0  (1, 2)      3
1  (1, 3)      2
2  (2, 3)      2

Upvotes: 1

SeaBean
SeaBean

Reputation: 23217

Assuming you already have a column of all unique pairs, here is a way you can count users:

You can use .map() and count number of items in pair as subset of course_id, as follows:

df2['users'] = df2['pair'].map(lambda x: len([y for y in df1['course_id'].values if set(x).issubset(y)]))


print(df2)

     pair  users
0  (1, 2)      3
1  (1, 3)      2
2  (2, 3)      2

One caveat of this solution maybe on the performance issue for large number of rows. I haven't tested it out yet. You may just try and see if it can fulfill your system performance consideration.

Upvotes: 0

Related Questions