B Furtado
B Furtado

Reputation: 1500

Pandas outer merge returning out of place values and extra Nans

I have two dataframes: fpm and real. See the samples below:

          month   fpm          region_id
94934     11  3.106522e+07    5300108
94935     23  3.476453e+07    5300108
94936     35  4.480962e+07    5300108
94937     47  4.148533e+07    5300108
94938     59  4.324909e+07    5300108
94939     71  5.908792e+07    5300108
94940     83  6.218772e+07    5300108
94941     95  6.881312e+07    5300108

        region_id  month    gdp_region
72397    5300108     35  5.390220e+10
72398    5300108     47  5.845612e+10
72399    5300108     59  6.707650e+10
72400    5300108     71  7.573268e+10
72401    5300108     83  8.466141e+10
72402    5300108     95  9.340400e+10

I want to merge them by region_idand month.

In order to do so, I used the command:

j = pd.merge(real, fpm, how='outer', on=['region_id', 'month'], left_index='off', right_index='off')

What happens is that some alien values from fpm column joins the same row of region_id 5300108 and the actual gdp_region is not actually joined at the correct region_id and month.

I have checked and both columns are of the same type: int

Here is the WRONG result. Notice that the INDEXES are different. I explicitl set index='off'.

   region_id  month    gdp_region           fpm
72397    5300108     35  5.390220e+10  1.649367e+07
72398    5300108     47  5.845612e+10  1.968157e+07
72399    5300108     59  6.707650e+10  2.088269e+07
72400    5300108     71  7.573268e+10  4.027545e+06
72401    5300108     83  8.466141e+10  1.197713e+06
72402    5300108     95  9.340400e+10  1.383501e+06
72403    5300108    107  1.102996e+11  1.718117e+06
72404    5300108    119  1.243238e+11  1.827867e+06
72405    5300108    131  1.441741e+11  2.053814e+06
72406    5300108    143  1.545690e+11  2.597804e+06
72407    5300108    155  1.641013e+11  2.908494e+06
72408    5300108    167  1.759067e+11  3.394452e+06
72409    5300108    179  1.974321e+11  4.022392e+06
94934    5300108     11           NaN  3.106522e+07
94935    5300108     23           NaN  3.476453e+07
94936    5300108     35           NaN  4.480962e+07
94937    5300108     47           NaN  4.148533e+07
94938    5300108     59           NaN  4.324909e+07
94939    5300108     71           NaN  5.908792e+07
94940    5300108     83           NaN  6.218772e+07
94941    5300108     95           NaN  6.881312e+07

Appreciate any suggestions. Thanks.

Upvotes: 2

Views: 370

Answers (3)

vestland
vestland

Reputation: 61074

A snippet with reproducible input would be helpful. The same goes for your desired output. You could try this at first with left_index = False and not 'Off':

import pandas as pd

fpm = pd.DataFrame({'idx':[94934,94935,94936,94937,94938,94939,94940,   94941],
                   'month': [11,23,35,47,59,71,83,95],
                   'fpm':   [3.106522e+07,3.476453e+07,4.480962e+07,    4.148533e+07,   4.324909e+07,   5.908792e+07,6.218772e+07,6.881312e+07],
                   'region_id':[5300108,5300108,5300108,5300108,5300108,5300108,5300108,5300108]})
fpm = fpm.set_index(['idx'])

real = pd.DataFrame({'idx':[72397,72398,72399,72400,72401,72402],
                   'region_id':[5300108,5300108,5300108,5300108,5300108,5300108],
                   'month':[35,47,59,71,83,95],
                    'gdp_region':[5.390220e+10,5.845612e+10,6.707650e+10,7.573268e+10,8.466141e+10,9.340400e+10]})

real = real.set_index(['idx'])

j = pd.merge(real, fpm, how='outer', on=['region_id', 'month'], left_index=False, right_index=False).sort_values(['region_id', 'month'])
print(j)

Does this represent your desired output?

enter image description here

Upvotes: 1

Uvar
Uvar

Reputation: 3462

For a nearly symmetric solution in which the months are in the same order no matter which merge you execute, you can use (nearly) equivalent expressions:

j = fpm.merge(real, how = 'outer', on = ['region_id', 'month']).sort_values('month')

j2 = real.merge(fpm, how = 'outer', on = ['region_id', 'month']).sort_values('month')

j3 = pd.merge(real, fpm, how = 'outer', on = ['region_id', 'month']).sort_values('month')

Please notice that left_index='off' and the like is not how it is supported by pandas. It uses booleans, of which the default value is set to False, which is exactly the behaviour you'd like.

It returns for me:

   month         fpm  region_id    gdp_region
0     11  31065220.0    5300108           NaN
1     23  34764530.0    5300108           NaN
2     35  44809620.0    5300108  5.390220e+10
3     47  41485330.0    5300108  5.845612e+10
4     59  43249090.0    5300108  6.707650e+10
5     71  59087920.0    5300108  7.573268e+10
6     83  62187720.0    5300108  8.466141e+10
7     95  68813120.0    5300108  9.340400e+10

If that is not your result, then probably, there is an extra column called "fpm" in your dataframe called "real". I can, however, recreate your original problem by setting left_index=True, right_index=True, so I think your keyword definition using "off" made python interpret a boolean True instead of the required behaviour.

Now, using left_index=True will fill up missing indices because of data length difference with the last known index value:

j4 = pd.merge(fpm, real, how = 'outer', on = ['region_id', 'month'], left_index=True, right_index=False).sort_values('month')
           month         fpm  region_id    gdp_region
72402     11  31065220.0    5300108           NaN
72402     23  34764530.0    5300108           NaN
72397     35  44809620.0    5300108  5.390220e+10
72398     47  41485330.0    5300108  5.845612e+10
72399     59  43249090.0    5300108  6.707650e+10
72400     71  59087920.0    5300108  7.573268e+10
72401     83  62187720.0    5300108  8.466141e+10
72402     95  68813120.0    5300108  9.340400e+10

The other way around, the index is retained:

j5 = pd.merge(fpm, real, how = 'outer', on = ['region_id', 'month'], left_index=False, right_index=True).sort_values('month')
       month         fpm  region_id    gdp_region
94934     11  31065220.0    5300108           NaN
94935     23  34764530.0    5300108           NaN
94936     35  44809620.0    5300108  5.390220e+10
94937     47  41485330.0    5300108  5.845612e+10
94938     59  43249090.0    5300108  6.707650e+10
94939     71  59087920.0    5300108  7.573268e+10
94940     83  62187720.0    5300108  8.466141e+10
94941     95  68813120.0    5300108  9.340400e+10

And using both as True will just use all of the index-value combinations, and you end up with a lot of NaNs, or possibly if you have the actual data, fill with the values of that particular index-value combination:

j6 = pd.merge(fpm, real, how = 'outer', on = ['region_id', 'month'], left_index=True, right_index=True).sort_values('month')
       month         fpm  region_id    gdp_region
94934     11  31065220.0    5300108           NaN
94935     23  34764530.0    5300108           NaN
72397     35         NaN    5300108  5.390220e+10
94936     35  44809620.0    5300108           NaN
72398     47         NaN    5300108  5.845612e+10
94937     47  41485330.0    5300108           NaN
72399     59         NaN    5300108  6.707650e+10
94938     59  43249090.0    5300108           NaN
72400     71         NaN    5300108  7.573268e+10
94939     71  59087920.0    5300108           NaN
72401     83         NaN    5300108  8.466141e+10
94940     83  62187720.0    5300108           NaN
72402     95         NaN    5300108  9.340400e+10
94941     95  68813120.0    5300108           NaN

Upvotes: 2

i.n.n.m
i.n.n.m

Reputation: 3046

Try something simple like this,

j = real.merge(fpm, how = 'right', on = ['region_id', 'month'])

As pandas document mentions, "If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on".

Upvotes: 2

Related Questions