Nicholas
Nicholas

Reputation: 3737

Python Pandas - Joining Dataframe with 'series' into another Dataframe

I have taken a dataframe with multiple rows per reference (because of a score), and have taken those scores and created a new dataframe which is one line per reference and the scores in a series. This is all working.

However, when I try to add the series dataframe back to the old dataframe, it is working, but is showing 'nan' for every cell.

import pandas as pd

df = pd.read_csv('GAPData.csv', encoding='cp1252')
df1 = df[['GrantRefNumber','Academic_Reviews']].copy()
dfGroup1 = df1.groupby('GrantRefNumber').apply(lambda x: list(x.Academic_Reviews))

dfGroup1 produces this...

GrantRefNumber
D/G001118       [5, 6, 6]
D/P041236       [5, 2, 6]
D/P753396    [2, 2, 6, 5]
D/P043434    [2, 5, 4, 5]
D/P034285    [4, 4, 6, 3]

Then I run:

df['Academic Reviews'] = dfGroup1.groupby(
    'GrantRefNumber').apply(lambda x: list(x.Academic_Reviews))
df.drop_duplicates('GrantRefNumber', inplace=True)

Which all compiles, but I am left with 'nan' for every cell under the new column 'Academic reviews', not the series.

Any advice on what I am doing wrong?

Upvotes: 1

Views: 53

Answers (1)

jezrael
jezrael

Reputation: 863116

There is problem your indexes are no aligned, so get all NaNs in column.

I think you need set_index with original df and then join:

df = df.set_index('GrantRefNumber')
df = df.join(dfGroup1.rename('Academic Reviews')).reset_index()

Sample:

df = pd.DataFrame({'GrantRefNumber':['D/G001118'] * 3 + ['D/P041236'] * 3 + ['D/P753396'] * 4,
                   'Academic_Reviews':[5, 6, 6, 5, 2, 6, 2, 2, 6, 5],
                   'another_data':[7, 8, 1, 8, 2, 8, 2, 0, 1, 5]})

print (df)
   Academic_Reviews GrantRefNumber  another_data
0                 5      D/G001118             7
1                 6      D/G001118             8
2                 6      D/G001118             1
3                 5      D/P041236             8
4                 2      D/P041236             2
5                 6      D/P041236             8
6                 2      D/P753396             2
7                 2      D/P753396             0
8                 6      D/P753396             1
9                 5      D/P753396             5

dfGroup1 = df.groupby('GrantRefNumber')['Academic_Reviews'].apply(list)
print (dfGroup1)
GrantRefNumber
D/G001118       [5, 6, 6]
D/P041236       [5, 2, 6]
D/P753396    [2, 2, 6, 5]
Name: Academic_Reviews, dtype: object

df = df.set_index('GrantRefNumber')
df = df.join(dfGroup1.rename('Academic_Reviews_list')).reset_index()
print (df)
  GrantRefNumber  Academic_Reviews  another_data Academic_Reviews_list
0      D/G001118                 5             7             [5, 6, 6]
1      D/G001118                 6             8             [5, 6, 6]
2      D/G001118                 6             1             [5, 6, 6]
3      D/P041236                 5             8             [5, 2, 6]
4      D/P041236                 2             2             [5, 2, 6]
5      D/P041236                 6             8             [5, 2, 6]
6      D/P753396                 2             2          [2, 2, 6, 5]
7      D/P753396                 2             0          [2, 2, 6, 5]
8      D/P753396                 6             1          [2, 2, 6, 5]
9      D/P753396                 5             5          [2, 2, 6, 5]

Upvotes: 1

Related Questions