I have two Pandas DataFrames that have similar but incomplete data. It's mostly country-related data, so here's some example of some fake Population (millions) data: Table A | Country | 2014 | 2015 | 2016 | 2017 | 2018 | |--------:|------|------|------|------|------| | USA | nan | nan | 323 | 325 | 328 | | UK | nan | nan | nan | 63 | 65 | | India | nan | nan | 800 | nan | 1100 | | China | nan | nan | 1100 | 1200 | 1300 | Table B | Country | 2014 | 2015 | 2016 | 2017 | 2019 | |--------:|------|------|------|------|------| | USA | 319 | 321 | 324 | nan | 330 | | UK | 58 | 60 | nan | nan | 68 | | India | 780 | 810 | 820 | nan | 1300 | | Nigeria | 90 | 100 | 105 | nan | 110 | From the above, I'm trying to demonstrate a few traits of the datasets I've found: Sometimes when two df's have an entry for a value (ex: India 2016) it will be different Sometimes some entries are missing from both df's (ex: UK 2016) Sometimes rows exist in one table but not the other (ex: Nigeria, China) Sometimes columns exist in one table but not the other (ex: 2018 in table A, 2019 in table B) So I want to combine them such that the nan 's are filled up from Table B if they exist, and columns & rows are added if they are not there. If there is a conflict, i don't care which one is used (the differences are not significant enough). Basically I would like a result to look like this (in this case, Table B vals are used when there is a tie): Table A union Table B | Country | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |--------:|------|------|------|------|------|------| | USA | 319 | 321 | 323 | 325 | 328 | 330 | | UK | 58 | 60 | nan | 63 | 65 | 68 | | India | 780 | 810 | 820 | nan | 1100 | 1300 | | China | nan | nan | 1100 | 1200 | 1300 | nan | | Nigeria | 90 | 100 | 105 | nan | nan | 110 | Intuitively, I just want the table as complete as possible . Is there a way to do this without having to run a for loop? The table is big and I don't want to deal with writing a long loop.

Reputation: 1968

Blending/Combining two semi-complete pandas dataframes

I have two Pandas DataFrames that have similar but incomplete data. It's mostly country-related data, so here's some example of some fake Population (millions) data:

Table A

| Country | 2014 | 2015 | 2016 | 2017 | 2018 |
|--------:|------|------|------|------|------|
| USA     | nan  | nan  | 323  | 325  | 328  |
| UK      | nan  | nan  | nan  | 63   | 65   |
| India   | nan  | nan  | 800  | nan  | 1100 |
| China   | nan  | nan  | 1100 | 1200 | 1300 |

Table B

| Country | 2014 | 2015 | 2016 | 2017 | 2019 |
|--------:|------|------|------|------|------|
| USA     | 319  | 321  | 324  | nan  | 330  |
| UK      | 58   | 60   | nan  | nan  | 68   |
| India   | 780  | 810  | 820  | nan  | 1300 |
| Nigeria | 90   | 100  | 105  | nan  | 110  |

From the above, I'm trying to demonstrate a few traits of the datasets I've found:

Sometimes when two df's have an entry for a value (ex: India 2016) it will be different
Sometimes some entries are missing from both df's (ex: UK 2016)
Sometimes rows exist in one table but not the other (ex: Nigeria, China)
Sometimes columns exist in one table but not the other (ex: 2018 in table A, 2019 in table B)

So I want to combine them such that the nan's are filled up from Table B if they exist, and columns & rows are added if they are not there. If there is a conflict, i don't care which one is used (the differences are not significant enough).

Basically I would like a result to look like this (in this case, Table B vals are used when there is a tie):

Table A union Table B

| Country | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
|--------:|------|------|------|------|------|------|
| USA     | 319  | 321  | 323  | 325  | 328  | 330  |
| UK      | 58   | 60   | nan  | 63   | 65   | 68   |
| India   | 780  | 810  | 820  | nan  | 1100 | 1300 |
| China   | nan  | nan  | 1100 | 1200 | 1300 | nan  |
| Nigeria | 90   | 100  | 105  | nan  | nan  | 110  |

Intuitively, I just want the table as complete as possible. Is there a way to do this without having to run a for loop? The table is big and I don't want to deal with writing a long loop.

Upvotes: 1

Answers (3)

Dani Mesejo

Reputation: 61920

You could do:

result = pd.concat([A.melt(id_vars='Country', var_name='Year', value_name='Population'),
                    B.melt(id_vars='Country', var_name='Year', value_name='Population')]).dropna()

print(pd.crosstab(index=result.Country, columns=result.Year, values=result.Population, aggfunc='mean'))

Output

Year      2014   2015    2016    2017    2018    2019
Country                                              
China      NaN    NaN  1100.0  1200.0  1300.0     NaN
India    780.0  810.0   810.0     NaN  1100.0  1300.0
Nigeria   90.0  100.0   105.0     NaN     NaN   110.0
UK        58.0   60.0     NaN    63.0    65.0    68.0
USA      319.0  321.0   323.5   325.0   328.0   330.0

Upvotes: 0

ansev

Reputation: 30930

Use:

df_join=( df2.merge(df1,how='outer')
           .groupby('Country')
           .apply(lambda x: x.ffill().bfill())
           .drop_duplicates('Country') )
print(df_join)

   Country   2014   2015    2016    2017    2019    2018
0      USA  319.0  321.0   324.0   325.0   330.0   328.0
1       UK   58.0   60.0     NaN    63.0    68.0    65.0
2    India  780.0  810.0   820.0     NaN  1300.0  1100.0
3  Nigeria   90.0  100.0   105.0     NaN   110.0     NaN
7    China    NaN    NaN  1100.0  1200.0     NaN  1300.0

Upvotes: 3

Quang Hoang

Reputation: 150785

You can use concat and groupby:

(pd.concat([df1,df2], sort=False)
   .groupby('Country', sort=False,as_index=False)
   .agg('first')
)

Output:

   Country   2014   2015    2016    2017    2018    2019
0      USA  319.0  321.0   323.0   325.0   328.0   330.0
1       UK   58.0   60.0     NaN    63.0    65.0    68.0
2    India  780.0  810.0   800.0     NaN  1100.0  1300.0
3    China    NaN    NaN  1100.0  1200.0  1300.0     NaN
4  Nigeria   90.0  100.0   105.0     NaN     NaN   110.0

Upvotes: 2

Blending/Combining two semi-complete pandas dataframes

Answers (3)

Related Questions