Reputation: 3337
I am new to Pandas and have been playing around a bit. I want to essential join two data frames based on a common index, user id. However, I want to do a little bit more as well. In the first table (which contains user ids and their age), each user id is unique. In the second table (which contains user ids and their courses, but is not sorted by user id), there can be multiple of the same user id (since a user can take multiple courses, but each row only contains one user id and one course). The dataset is fairly large, so this is not something I can manually do. If I want to join so that I still maintain the property that user id's in the resulting table is unique so that each row contains {user id, age, all their courses}, how would I do so.
For example, I might have something like
Table 1: Table 2:
User Id Age User Id Coursework
1 18 1 Pre Calculus
2 17 2 Chemistry
3 18 3 English
4 16 2 Linear Algebra
1 World History
4 Multivariable Calculus
1 Psychology
3 Government
4 Physics (E&M)
And I want the result to look like:
User Id Age Coursework
1 18 Pre Calculus, World History, Psychology
2 17 Chemistry, Linear Algebra
3 18 English, Government
4 16 Multivariable Calculus, Physics (E&M)
How would I go about doing this in a way that is as simple as possible? It might require a different approach than just merging and doing something on top of that. Anyway, this would be really useful for something I am doing with a large dataset. Thanks for your help in advance!
Upvotes: 1
Views: 324
Reputation: 863176
I think you need groupby
with apply
join
with merge
:
df2 = df2.groupby('User Id')['Coursework'].apply(', '.join).reset_index()
#default inner join
df = pd.merge(df1, df2, on='User Id')
#for left join (if some values in df2 are missing)
#df = pd.merge(df1, df2, on='User Id', how='left')
print (df)
User Id Age Coursework
0 1 18 Pre Calculus, World History, Psychology
1 2 17 Chemistry, Linear Algebra
2 3 18 English, Government
3 4 16 Multivariable Calculus, Physics (E&M)
Another solution with concat
:
df2 = df2.groupby('User Id')['Coursework'].apply(', '.join)
df1 = df1.set_index('User Id')
df = pd.concat([df1, df2], axis=1, join='inner').reset_index()
print (df)
User Id Age Coursework
0 1 18 Pre Calculus, World History, Psychology
1 2 17 Chemistry, Linear Algebra
2 3 18 English, Government
3 4 16 Multivariable Calculus, Physics (E&M)
Upvotes: 1
Reputation: 210892
I'd first transform (group) the second DF and then merge it with the first DF:
In [11]: b.groupby('User_Id', as_index=False)[['Coursework']] \
.agg(', '.join).merge(a, on='User_Id')
Out[11]:
User_Id Coursework Age
0 1 Pre Calculus, World History, Psychology 18
1 2 Chemistry, Linear Algebra 17
2 3 English, Government 18
3 4 Multivariable Calculus, Physics (E&M) 16
Data:
In [12]: a
Out[12]:
User_Id Age
0 1 18
1 2 17
2 3 18
3 4 16
In [13]: b
Out[13]:
User_Id Coursework
0 1 Pre Calculus
1 2 Chemistry
2 3 English
3 2 Linear Algebra
4 1 World History
5 4 Multivariable Calculus
6 1 Psychology
7 3 Government
8 4 Physics (E&M)
Explanation:
In [15]: b.groupby('User_Id', as_index=False)[['Coursework']].agg(', '.join)
Out[15]:
User_Id Coursework
0 1 Pre Calculus, World History, Psychology
1 2 Chemistry, Linear Algebra
2 3 English, Government
3 4 Multivariable Calculus, Physics (E&M)
Upvotes: 1