Jane Sully
Jane Sully

Reputation: 3337

Merging two data frames on a common index (without having to create separate rows)

I am new to Pandas and have been playing around a bit. I want to essential join two data frames based on a common index, user id. However, I want to do a little bit more as well. In the first table (which contains user ids and their age), each user id is unique. In the second table (which contains user ids and their courses, but is not sorted by user id), there can be multiple of the same user id (since a user can take multiple courses, but each row only contains one user id and one course). The dataset is fairly large, so this is not something I can manually do. If I want to join so that I still maintain the property that user id's in the resulting table is unique so that each row contains {user id, age, all their courses}, how would I do so.

For example, I might have something like

Table 1:             Table 2:
User Id Age          User Id  Coursework
1       18           1        Pre Calculus
2       17           2        Chemistry
3       18           3        English
4       16           2        Linear Algebra
                     1        World History
                     4        Multivariable Calculus
                     1        Psychology
                     3        Government
                     4        Physics (E&M)

And I want the result to look like:

User Id Age  Coursework
1       18   Pre Calculus, World History, Psychology
2       17   Chemistry, Linear Algebra
3       18   English, Government
4       16   Multivariable Calculus, Physics (E&M)

How would I go about doing this in a way that is as simple as possible? It might require a different approach than just merging and doing something on top of that. Anyway, this would be really useful for something I am doing with a large dataset. Thanks for your help in advance!

Upvotes: 1

Views: 324

Answers (2)

jezrael
jezrael

Reputation: 863176

I think you need groupby with apply join with merge:

df2 = df2.groupby('User Id')['Coursework'].apply(', '.join).reset_index()
#default inner join
df = pd.merge(df1, df2, on='User Id')
#for left join (if some values in df2 are missing)
#df = pd.merge(df1, df2, on='User Id', how='left')
print (df)
   User Id  Age                               Coursework
0        1   18  Pre Calculus, World History, Psychology
1        2   17                Chemistry, Linear Algebra
2        3   18                      English, Government
3        4   16    Multivariable Calculus, Physics (E&M)

Another solution with concat:

df2 = df2.groupby('User Id')['Coursework'].apply(', '.join)
df1 = df1.set_index('User Id')
df = pd.concat([df1, df2], axis=1, join='inner').reset_index()
print (df)
   User Id  Age                               Coursework
0        1   18  Pre Calculus, World History, Psychology
1        2   17                Chemistry, Linear Algebra
2        3   18                      English, Government
3        4   16    Multivariable Calculus, Physics (E&M)

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210892

I'd first transform (group) the second DF and then merge it with the first DF:

In [11]: b.groupby('User_Id', as_index=False)[['Coursework']] \
          .agg(', '.join).merge(a, on='User_Id')
Out[11]:
   User_Id                               Coursework  Age
0        1  Pre Calculus, World History, Psychology   18
1        2                Chemistry, Linear Algebra   17
2        3                      English, Government   18
3        4    Multivariable Calculus, Physics (E&M)   16

Data:

In [12]: a
Out[12]:
   User_Id  Age
0        1   18
1        2   17
2        3   18
3        4   16

In [13]: b
Out[13]:
   User_Id              Coursework
0        1            Pre Calculus
1        2               Chemistry
2        3                 English
3        2          Linear Algebra
4        1           World History
5        4  Multivariable Calculus
6        1              Psychology
7        3              Government
8        4           Physics (E&M)

Explanation:

In [15]: b.groupby('User_Id', as_index=False)[['Coursework']].agg(', '.join)
Out[15]:
   User_Id                               Coursework
0        1  Pre Calculus, World History, Psychology
1        2                Chemistry, Linear Algebra
2        3                      English, Government
3        4    Multivariable Calculus, Physics (E&M)

Upvotes: 1

Related Questions