Merging two data frames on a common index (without having to create separate rows)

Question

I am new to Pandas and have been playing around a bit. I want to essential join two data frames based on a common index, user id. However, I want to do a little bit more as well. In the first table (which contains user ids and their age), each user id is unique. In the second table (which contains user ids and their courses, but is not sorted by user id), there can be multiple of the same user id (since a user can take multiple courses, but each row only contains one user id and one course). The dataset is fairly large, so this is not something I can manually do. If I want to join so that I still maintain the property that user id's in the resulting table is unique so that each row contains {user id, age, all their courses}, how would I do so.

For example, I might have something like

Table 1:             Table 2:
User Id Age          User Id  Coursework
1       18           1        Pre Calculus
2       17           2        Chemistry
3       18           3        English
4       16           2        Linear Algebra
                     1        World History
                     4        Multivariable Calculus
                     1        Psychology
                     3        Government
                     4        Physics (E&M)

And I want the result to look like:

User Id Age  Coursework
1       18   Pre Calculus, World History, Psychology
2       17   Chemistry, Linear Algebra
3       18   English, Government
4       16   Multivariable Calculus, Physics (E&M)

How would I go about doing this in a way that is as simple as possible? It might require a different approach than just merging and doing something on top of that. Anyway, this would be really useful for something I am doing with a large dataset. Thanks for your help in advance!

jezrael · Accepted Answer

I think you need groupby with apply join with merge:

df2 = df2.groupby('User Id')['Coursework'].apply(', '.join).reset_index()
#default inner join
df = pd.merge(df1, df2, on='User Id')
#for left join (if some values in df2 are missing)
#df = pd.merge(df1, df2, on='User Id', how='left')
print (df)
   User Id  Age                               Coursework
0        1   18  Pre Calculus, World History, Psychology
1        2   17                Chemistry, Linear Algebra
2        3   18                      English, Government
3        4   16    Multivariable Calculus, Physics (E&M)

Another solution with concat:

df2 = df2.groupby('User Id')['Coursework'].apply(', '.join)
df1 = df1.set_index('User Id')
df = pd.concat([df1, df2], axis=1, join='inner').reset_index()
print (df)
   User Id  Age                               Coursework
0        1   18  Pre Calculus, World History, Psychology
1        2   17                Chemistry, Linear Algebra
2        3   18                      English, Government
3        4   16    Multivariable Calculus, Physics (E&M)

Merging two data frames on a common index (without having to create separate rows)

Answers (2)

Related Questions