Reputation: 199
I’m having some trouble while merging 2 different-size DataFrames in Pandas with overlapped columns.
df1 =
+-------+--------+--------+-------+
| value | method | number | price |
+-------+--------+--------+-------+
| 0.5 | add | 489245 | 600 |
| 0.7 | add | 489245 | NaN |
| 0.7 | mul | 584682 | 225 |
| 0.9 | mul | 624602 | NaN |
| 0.95 | mul | 624602 | NaN |
| 0.99 | mul | 624602 | NaN |
| NaN | NaN | 900000 | 300 |
| NaN | add | 900000 | NaN |
+-------+--------+--------+-------+
df2 =
+--------+-------+-----+-----+
| number | price | loc | get |
+--------+-------+-----+-----+
| 489245 | 200 | aa | up |
| 584682 | NaN | ab | NaN |
| 624602 | NaN | bb | NaN |
| 900000 | NaN | cc | dn |
+--------+-------+-----+-----+
The result I expected:
+-------+--------+--------+-------+-----+-----+
| value | method | number | price | loc | get |
+-------+--------+--------+-------+-----+-----+
| 0.5 | add | 489245 | 200 | aa | up |
| 0.7 | add | 489245 | 200 | aa | up |
| 0.7 | mul | 584682 | 225 | ab | NaN |
| 0.9 | mul | 624602 | NaN | bb | NaN |
| 0.95 | mul | 624602 | NaN | bb | NaN |
| 0.99 | mul | 624602 | NaN | bb | NaN |
| NaN | NaN | 900000 | 300 | cc | dn |
| NaN | add | 900000 | NaN | cc | dn |
+-------+--------+--------+-------+-----+-----+
E.g., column to join on = 'number'. In case both df1 and df2 has non-NaN value on overlapped column (like 'price'), df2 will be preferable. Otherwise, one with non-NaN should be written.
Upvotes: 1
Views: 3401
Reputation: 1026
Summary: merge
followed by a little DataFrame manipulation and then update
.
First load in the data (included here for reference).
In [1]: import pandas
In [2]: df1 = pandas.read_csv('df1.csv')
In [3]: df2 = pandas.read_csv('df2.csv')
In [4]: df1
Out[4]:
value method number price
0 0.50 add 489245 600
1 0.70 add 489245 NaN
2 0.70 mul 584682 225
3 0.90 mul 624602 NaN
4 0.95 mul 624602 NaN
5 0.99 mul 624602 NaN
6 NaN NaN 900000 300
7 NaN add 900000 NaN
8 NaN NaN NaN NaN
In [5]: df2
Out[5]:
number price loc get
0 489245 200 aa up
1 584682 NaN ab NaN
2 624602 NaN bb NaN
3 900000 NaN cc dn
Next merge the two data frames on 'number'
In [6]: mdf = pandas.merge(df1,df2,on='number')
In [7]: mdf
Out[7]:
value method number price_x price_y loc get
0 0.50 add 489245 600 200 aa up
1 0.70 add 489245 NaN 200 aa up
2 0.70 mul 584682 225 NaN ab NaN
3 0.90 mul 624602 NaN NaN bb NaN
4 0.95 mul 624602 NaN NaN bb NaN
5 0.99 mul 624602 NaN NaN bb NaN
6 NaN NaN 900000 300 NaN cc dn
7 NaN add 900000 NaN NaN cc dn
When presented with identical columns, pandas.merge
puts both in the merged DataFrame with suffixes appended. In this case we want to update 'price_x'
with 'price_y'
and call the result 'price'
. To do this we can create a data frame that only consists of the data in 'price_y'
, remove the column from the merged DataFrame, and rename the price columns back to 'price'
.
In [8]: pdf = mdf.price_y
In [9]: pdf = pandas.DataFrame(pdf)
In [10]: del mdf['price_y']
In [11]: mdf.rename(columns = {'price_x':'price'},inplace=True)
In [12]: pdf.rename(columns = {'price_y':'price'},inplace=True)
In [13]: mdf
Out[13]:
value method number price loc get
0 0.50 add 489245 600 aa up
1 0.70 add 489245 NaN aa up
2 0.70 mul 584682 225 ab NaN
3 0.90 mul 624602 NaN bb NaN
4 0.95 mul 624602 NaN bb NaN
5 0.99 mul 624602 NaN bb NaN
6 NaN NaN 900000 300 cc dn
7 NaN add 900000 NaN cc dn
In [14]: pdf
Out[14]:
price
0 200
1 200
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
Now we can use .update
to put any non-NaN value in pdf
into mdf
.
In [15]: mdf.update(pdf)
In [16]: mdf
Out[16]:
value method number price loc get
0 0.50 add 489245 200 aa up
1 0.70 add 489245 200 aa up
2 0.70 mul 584682 225 ab NaN
3 0.90 mul 624602 NaN bb NaN
4 0.95 mul 624602 NaN bb NaN
5 0.99 mul 624602 NaN bb NaN
6 NaN NaN 900000 300 cc dn
7 NaN add 900000 NaN cc dn
If a more complicated set of rules for choosing values were desired then replace mdf.update(pdf)
with mdf.combine(pdf, function_of_two_variables_returning_preferred_value)
.
Upvotes: 3
Reputation: 117370
You can use numpy.where()
after merge:
>>> df1 = pd.DataFrame({'number':[1,1,2,2,3], 'price':[600,np.NaN,225,np.NaN,np.NaN], 'method':['add','add','mul','mul','mul']})
>>> df2 = pd.DataFrame({'number':[1,2,3], 'price':[200,np.NaN,np.NaN], 'loc':['aa','bb','cc']})
>>> df3 = pd.merge(df1, df2, on='number', suffixes=['_1', ''])
>>> df3
method number price_1 loc price
0 add 1 600 aa 200
1 add 1 NaN aa 200
2 mul 2 225 bb NaN
3 mul 2 NaN bb NaN
4 mul 3 NaN cc NaN
>>> df3['price'] = np.where(df3['price'].isnull(), df3['price_1'], df3['price'])
>>> df3
method number price_1 loc price
0 add 1 600 aa 200
1 add 1 NaN aa 200
2 mul 2 225 bb 225
3 mul 2 NaN bb NaN
4 mul 3 NaN cc NaN
>>> del df3['price_1']
>>> df3
method number loc price
0 add 1 aa 200
1 add 1 aa 200
2 mul 2 bb 225
3 mul 2 bb NaN
4 mul 3 cc NaN
Another way to do this is to use pandas.Series.fillna()
method:
>>> df3['price'] = df3['price'].fillna(df3['price_1'])
>>> del df3['price_1']
>>> df3
method number loc price
0 add 1 aa 200
1 add 1 aa 200
2 mul 2 bb 225
3 mul 2 bb NaN
4 mul 3 cc NaN
Upvotes: 2