Reputation: 1500
I have two dataframes: fpm
and real
. See the samples below:
month fpm region_id
94934 11 3.106522e+07 5300108
94935 23 3.476453e+07 5300108
94936 35 4.480962e+07 5300108
94937 47 4.148533e+07 5300108
94938 59 4.324909e+07 5300108
94939 71 5.908792e+07 5300108
94940 83 6.218772e+07 5300108
94941 95 6.881312e+07 5300108
region_id month gdp_region
72397 5300108 35 5.390220e+10
72398 5300108 47 5.845612e+10
72399 5300108 59 6.707650e+10
72400 5300108 71 7.573268e+10
72401 5300108 83 8.466141e+10
72402 5300108 95 9.340400e+10
I want to merge them by region_id
and month
.
In order to do so, I used the command:
j = pd.merge(real, fpm, how='outer', on=['region_id', 'month'], left_index='off', right_index='off')
What happens is that some alien values from fpm
column joins the same row of region_id 5300108
and the actual gdp_region
is not actually joined at the correct region_id
and month
.
I have checked and both columns are of the same type: int
Here is the WRONG result. Notice that the INDEXES are different. I explicitl set index='off'
.
region_id month gdp_region fpm
72397 5300108 35 5.390220e+10 1.649367e+07
72398 5300108 47 5.845612e+10 1.968157e+07
72399 5300108 59 6.707650e+10 2.088269e+07
72400 5300108 71 7.573268e+10 4.027545e+06
72401 5300108 83 8.466141e+10 1.197713e+06
72402 5300108 95 9.340400e+10 1.383501e+06
72403 5300108 107 1.102996e+11 1.718117e+06
72404 5300108 119 1.243238e+11 1.827867e+06
72405 5300108 131 1.441741e+11 2.053814e+06
72406 5300108 143 1.545690e+11 2.597804e+06
72407 5300108 155 1.641013e+11 2.908494e+06
72408 5300108 167 1.759067e+11 3.394452e+06
72409 5300108 179 1.974321e+11 4.022392e+06
94934 5300108 11 NaN 3.106522e+07
94935 5300108 23 NaN 3.476453e+07
94936 5300108 35 NaN 4.480962e+07
94937 5300108 47 NaN 4.148533e+07
94938 5300108 59 NaN 4.324909e+07
94939 5300108 71 NaN 5.908792e+07
94940 5300108 83 NaN 6.218772e+07
94941 5300108 95 NaN 6.881312e+07
Appreciate any suggestions. Thanks.
Upvotes: 2
Views: 370
Reputation: 61074
A snippet with reproducible input would be helpful. The same goes for your desired output. You could try this at first with left_index = False
and not 'Off':
import pandas as pd
fpm = pd.DataFrame({'idx':[94934,94935,94936,94937,94938,94939,94940, 94941],
'month': [11,23,35,47,59,71,83,95],
'fpm': [3.106522e+07,3.476453e+07,4.480962e+07, 4.148533e+07, 4.324909e+07, 5.908792e+07,6.218772e+07,6.881312e+07],
'region_id':[5300108,5300108,5300108,5300108,5300108,5300108,5300108,5300108]})
fpm = fpm.set_index(['idx'])
real = pd.DataFrame({'idx':[72397,72398,72399,72400,72401,72402],
'region_id':[5300108,5300108,5300108,5300108,5300108,5300108],
'month':[35,47,59,71,83,95],
'gdp_region':[5.390220e+10,5.845612e+10,6.707650e+10,7.573268e+10,8.466141e+10,9.340400e+10]})
real = real.set_index(['idx'])
j = pd.merge(real, fpm, how='outer', on=['region_id', 'month'], left_index=False, right_index=False).sort_values(['region_id', 'month'])
print(j)
Does this represent your desired output?
Upvotes: 1
Reputation: 3462
For a nearly symmetric solution in which the months are in the same order no matter which merge you execute, you can use (nearly) equivalent expressions:
j = fpm.merge(real, how = 'outer', on = ['region_id', 'month']).sort_values('month')
j2 = real.merge(fpm, how = 'outer', on = ['region_id', 'month']).sort_values('month')
j3 = pd.merge(real, fpm, how = 'outer', on = ['region_id', 'month']).sort_values('month')
Please notice that left_index='off'
and the like is not how it is supported by pandas. It uses booleans, of which the default value is set to False, which is exactly the behaviour you'd like.
It returns for me:
month fpm region_id gdp_region
0 11 31065220.0 5300108 NaN
1 23 34764530.0 5300108 NaN
2 35 44809620.0 5300108 5.390220e+10
3 47 41485330.0 5300108 5.845612e+10
4 59 43249090.0 5300108 6.707650e+10
5 71 59087920.0 5300108 7.573268e+10
6 83 62187720.0 5300108 8.466141e+10
7 95 68813120.0 5300108 9.340400e+10
If that is not your result, then probably, there is an extra column called "fpm" in your dataframe called "real". I can, however, recreate your original problem by setting left_index=True, right_index=True
, so I think your keyword definition using "off" made python interpret a boolean True
instead of the required behaviour.
Now, using left_index=True
will fill up missing indices because of data length difference with the last known index value:
j4 = pd.merge(fpm, real, how = 'outer', on = ['region_id', 'month'], left_index=True, right_index=False).sort_values('month')
month fpm region_id gdp_region
72402 11 31065220.0 5300108 NaN
72402 23 34764530.0 5300108 NaN
72397 35 44809620.0 5300108 5.390220e+10
72398 47 41485330.0 5300108 5.845612e+10
72399 59 43249090.0 5300108 6.707650e+10
72400 71 59087920.0 5300108 7.573268e+10
72401 83 62187720.0 5300108 8.466141e+10
72402 95 68813120.0 5300108 9.340400e+10
The other way around, the index is retained:
j5 = pd.merge(fpm, real, how = 'outer', on = ['region_id', 'month'], left_index=False, right_index=True).sort_values('month')
month fpm region_id gdp_region
94934 11 31065220.0 5300108 NaN
94935 23 34764530.0 5300108 NaN
94936 35 44809620.0 5300108 5.390220e+10
94937 47 41485330.0 5300108 5.845612e+10
94938 59 43249090.0 5300108 6.707650e+10
94939 71 59087920.0 5300108 7.573268e+10
94940 83 62187720.0 5300108 8.466141e+10
94941 95 68813120.0 5300108 9.340400e+10
And using both as True
will just use all of the index-value combinations, and you end up with a lot of NaNs, or possibly if you have the actual data, fill with the values of that particular index-value combination:
j6 = pd.merge(fpm, real, how = 'outer', on = ['region_id', 'month'], left_index=True, right_index=True).sort_values('month')
month fpm region_id gdp_region
94934 11 31065220.0 5300108 NaN
94935 23 34764530.0 5300108 NaN
72397 35 NaN 5300108 5.390220e+10
94936 35 44809620.0 5300108 NaN
72398 47 NaN 5300108 5.845612e+10
94937 47 41485330.0 5300108 NaN
72399 59 NaN 5300108 6.707650e+10
94938 59 43249090.0 5300108 NaN
72400 71 NaN 5300108 7.573268e+10
94939 71 59087920.0 5300108 NaN
72401 83 NaN 5300108 8.466141e+10
94940 83 62187720.0 5300108 NaN
72402 95 NaN 5300108 9.340400e+10
94941 95 68813120.0 5300108 NaN
Upvotes: 2
Reputation: 3046
Try something simple like this,
j = real.merge(fpm, how = 'right', on = ['region_id', 'month'])
As pandas document mentions, "If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on".
Upvotes: 2