Look up and running totals in Pandas

Question

I have 3 dataframes in Pandas:

1) user_interests:

With 'user' as an id, and 'interest' as an interest:

2) similarity_score:

With 'user' as a unique id matching ids in user_interests:

3) similarity_total:

With 'interest' being a list of all the unique interests in user_interets:

What I need to do:

Step 1: Look up interest from similarity_table to user_interests

Step 2: Take the corresponding user from user_interests and match it to the user in similarity_score

Step 3: Take the corresponding similarity_score from similarity_score and add it to the corresponding interest in similarity_total

The ultimate objective being to total the similarity scores of all users interested in the subjects in similarity_total. A diagram may help:

I know this can be done in Pandas in one line, however I am not there yet. If anyone can point me in the right direction, that would be amazing. Thanks!

Scott Boston · Accepted Answer

IIUC, I think you need:

user_interest['similarity_score'] = user_interest['users'].map(similarity_score.set_index('user')['similarity_score'])

similarity_total = user_interest.groupby('interest', as_index=False)['similarity_score'].sum()

Output:

            interest  similarity_score
0           Big Data          1.000000
1          Cassandra          1.338062
2              HBase          0.338062
3              Hbase          1.000000
4               Java          1.154303
5            MongoDB          0.338062
6              NoSQL          0.338062
7           Postgres          0.338062
8             Python          0.154303
9                  R          0.154303
10             Spark          1.000000
11             Storm          1.000000
12     decision tree          0.000000
13            libsvm          0.000000
14  machine learning          0.000000
15             numpy          0.000000
16            pandas          0.000000
17       probability          0.000000
18        regression          0.000000
19      scikit-learn          0.000000
20             scipy          0.000000
21        statistics          0.000000
22       statsmodels          0.000000

Look up and running totals in Pandas

Answers (2)

Related Questions