Reputation: 1968
I have two Pandas DataFrames that have similar but incomplete data. It's mostly country-related data, so here's some example of some fake Population (millions)
data:
Table A
| Country | 2014 | 2015 | 2016 | 2017 | 2018 |
|--------:|------|------|------|------|------|
| USA | nan | nan | 323 | 325 | 328 |
| UK | nan | nan | nan | 63 | 65 |
| India | nan | nan | 800 | nan | 1100 |
| China | nan | nan | 1100 | 1200 | 1300 |
Table B
| Country | 2014 | 2015 | 2016 | 2017 | 2019 |
|--------:|------|------|------|------|------|
| USA | 319 | 321 | 324 | nan | 330 |
| UK | 58 | 60 | nan | nan | 68 |
| India | 780 | 810 | 820 | nan | 1300 |
| Nigeria | 90 | 100 | 105 | nan | 110 |
From the above, I'm trying to demonstrate a few traits of the datasets I've found:
So I want to combine them such that the nan
's are filled up from Table B if they exist, and columns & rows are added if they are not there. If there is a conflict, i don't care which one is used (the differences are not significant enough).
Basically I would like a result to look like this (in this case, Table B vals are used when there is a tie):
Table A union Table B
| Country | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
|--------:|------|------|------|------|------|------|
| USA | 319 | 321 | 323 | 325 | 328 | 330 |
| UK | 58 | 60 | nan | 63 | 65 | 68 |
| India | 780 | 810 | 820 | nan | 1100 | 1300 |
| China | nan | nan | 1100 | 1200 | 1300 | nan |
| Nigeria | 90 | 100 | 105 | nan | nan | 110 |
Intuitively, I just want the table as complete as possible. Is there a way to do this without having to run a for loop? The table is big and I don't want to deal with writing a long loop.
Upvotes: 1
Views: 41
Reputation: 61920
You could do:
result = pd.concat([A.melt(id_vars='Country', var_name='Year', value_name='Population'),
B.melt(id_vars='Country', var_name='Year', value_name='Population')]).dropna()
print(pd.crosstab(index=result.Country, columns=result.Year, values=result.Population, aggfunc='mean'))
Output
Year 2014 2015 2016 2017 2018 2019
Country
China NaN NaN 1100.0 1200.0 1300.0 NaN
India 780.0 810.0 810.0 NaN 1100.0 1300.0
Nigeria 90.0 100.0 105.0 NaN NaN 110.0
UK 58.0 60.0 NaN 63.0 65.0 68.0
USA 319.0 321.0 323.5 325.0 328.0 330.0
Upvotes: 0
Reputation: 30930
Use:
df_join=( df2.merge(df1,how='outer')
.groupby('Country')
.apply(lambda x: x.ffill().bfill())
.drop_duplicates('Country') )
print(df_join)
Country 2014 2015 2016 2017 2019 2018
0 USA 319.0 321.0 324.0 325.0 330.0 328.0
1 UK 58.0 60.0 NaN 63.0 68.0 65.0
2 India 780.0 810.0 820.0 NaN 1300.0 1100.0
3 Nigeria 90.0 100.0 105.0 NaN 110.0 NaN
7 China NaN NaN 1100.0 1200.0 NaN 1300.0
Upvotes: 3
Reputation: 150785
You can use concat
and groupby
:
(pd.concat([df1,df2], sort=False)
.groupby('Country', sort=False,as_index=False)
.agg('first')
)
Output:
Country 2014 2015 2016 2017 2018 2019
0 USA 319.0 321.0 323.0 325.0 328.0 330.0
1 UK 58.0 60.0 NaN 63.0 65.0 68.0
2 India 780.0 810.0 800.0 NaN 1100.0 1300.0
3 China NaN NaN 1100.0 1200.0 1300.0 NaN
4 Nigeria 90.0 100.0 105.0 NaN NaN 110.0
Upvotes: 2