user2915556
user2915556

Reputation: 199

Partial merge in Pandas

I’m having some trouble while merging 2 different-size DataFrames in Pandas with overlapped columns.

df1 =

+-------+--------+--------+-------+
| value | method | number | price |
+-------+--------+--------+-------+
| 0.5   | add    | 489245 | 600   |
| 0.7   | add    | 489245 | NaN   |
| 0.7   | mul    | 584682 | 225   |
| 0.9   | mul    | 624602 | NaN   |
| 0.95  | mul    | 624602 | NaN   |
| 0.99  | mul    | 624602 | NaN   |
| NaN   | NaN    | 900000 | 300   |
| NaN   | add    | 900000 | NaN   |
+-------+--------+--------+-------+

df2 =

+--------+-------+-----+-----+
| number | price | loc | get |
+--------+-------+-----+-----+
| 489245 | 200   | aa  | up  |
| 584682 | NaN   | ab  | NaN |
| 624602 | NaN   | bb  | NaN |
| 900000 | NaN   | cc  | dn  |
+--------+-------+-----+-----+

The result I expected:

+-------+--------+--------+-------+-----+-----+
| value | method | number | price | loc | get |
+-------+--------+--------+-------+-----+-----+
| 0.5   | add    | 489245 | 200   | aa  | up  |
| 0.7   | add    | 489245 | 200   | aa  | up  |
| 0.7   | mul    | 584682 | 225   | ab  | NaN |
| 0.9   | mul    | 624602 | NaN   | bb  | NaN |
| 0.95  | mul    | 624602 | NaN   | bb  | NaN |
| 0.99  | mul    | 624602 | NaN   | bb  | NaN |
| NaN   | NaN    | 900000 | 300   | cc  | dn  |
| NaN   | add    | 900000 | NaN   | cc  | dn  |
+-------+--------+--------+-------+-----+-----+

E.g., column to join on = 'number'. In case both df1 and df2 has non-NaN value on overlapped column (like 'price'), df2 will be preferable. Otherwise, one with non-NaN should be written.

Upvotes: 1

Views: 3401

Answers (2)

Chris H.
Chris H.

Reputation: 1026

Summary: merge followed by a little DataFrame manipulation and then update.

First load in the data (included here for reference).

In [1]: import pandas
In [2]: df1 = pandas.read_csv('df1.csv')
In [3]: df2 = pandas.read_csv('df2.csv')
In [4]: df1
Out[4]: 
   value method  number  price
0   0.50    add  489245    600
1   0.70    add  489245    NaN
2   0.70    mul  584682    225
3   0.90    mul  624602    NaN
4   0.95    mul  624602    NaN
5   0.99    mul  624602    NaN
6    NaN    NaN  900000    300
7    NaN    add  900000    NaN
8    NaN    NaN     NaN    NaN

In [5]: df2
Out[5]: 
   number  price loc  get
0  489245    200  aa  up 
1  584682    NaN  ab  NaN
2  624602    NaN  bb  NaN
3  900000    NaN  cc  dn 

Next merge the two data frames on 'number'

In [6]: mdf = pandas.merge(df1,df2,on='number')
In [7]: mdf
Out[7]: 
   value method  number  price_x  price_y loc  get
0   0.50    add  489245      600      200  aa  up 
1   0.70    add  489245      NaN      200  aa  up 
2   0.70    mul  584682      225      NaN  ab  NaN
3   0.90    mul  624602      NaN      NaN  bb  NaN
4   0.95    mul  624602      NaN      NaN  bb  NaN
5   0.99    mul  624602      NaN      NaN  bb  NaN
6    NaN    NaN  900000      300      NaN  cc  dn 
7    NaN    add  900000      NaN      NaN  cc  dn 

When presented with identical columns, pandas.merge puts both in the merged DataFrame with suffixes appended. In this case we want to update 'price_x' with 'price_y' and call the result 'price'. To do this we can create a data frame that only consists of the data in 'price_y', remove the column from the merged DataFrame, and rename the price columns back to 'price'.

In [8]: pdf = mdf.price_y
In [9]: pdf = pandas.DataFrame(pdf)

In [10]: del mdf['price_y']

In [11]: mdf.rename(columns = {'price_x':'price'},inplace=True)
In [12]: pdf.rename(columns = {'price_y':'price'},inplace=True)
In [13]: mdf
Out[13]: 
   value method  number  price loc  get
0   0.50    add  489245    600  aa  up 
1   0.70    add  489245    NaN  aa  up 
2   0.70    mul  584682    225  ab  NaN
3   0.90    mul  624602    NaN  bb  NaN
4   0.95    mul  624602    NaN  bb  NaN
5   0.99    mul  624602    NaN  bb  NaN
6    NaN    NaN  900000    300  cc  dn 
7    NaN    add  900000    NaN  cc  dn 

In [14]: pdf
Out[14]: 
   price
0    200
1    200
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN

Now we can use .update to put any non-NaN value in pdf into mdf.

In [15]: mdf.update(pdf)

In [16]: mdf
Out[16]: 
   value method  number  price loc  get
0   0.50    add  489245    200  aa  up 
1   0.70    add  489245    200  aa  up 
2   0.70    mul  584682    225  ab  NaN
3   0.90    mul  624602    NaN  bb  NaN
4   0.95    mul  624602    NaN  bb  NaN
5   0.99    mul  624602    NaN  bb  NaN
6    NaN    NaN  900000    300  cc  dn 
7    NaN    add  900000    NaN  cc  dn 

If a more complicated set of rules for choosing values were desired then replace mdf.update(pdf) with mdf.combine(pdf, function_of_two_variables_returning_preferred_value).

Upvotes: 3

roman
roman

Reputation: 117370

You can use numpy.where() after merge:

>>> df1 = pd.DataFrame({'number':[1,1,2,2,3], 'price':[600,np.NaN,225,np.NaN,np.NaN], 'method':['add','add','mul','mul','mul']})
>>> df2 = pd.DataFrame({'number':[1,2,3], 'price':[200,np.NaN,np.NaN], 'loc':['aa','bb','cc']})
>>> df3 = pd.merge(df1, df2, on='number', suffixes=['_1', ''])
>>> df3
  method  number  price_1 loc    price
0    add       1      600  aa      200
1    add       1      NaN  aa      200
2    mul       2      225  bb      NaN
3    mul       2      NaN  bb      NaN
4    mul       3      NaN  cc      NaN

>>> df3['price'] = np.where(df3['price'].isnull(), df3['price_1'], df3['price'])
>>> df3
  method  number  price_1 loc  price
0    add       1      600  aa    200
1    add       1      NaN  aa    200
2    mul       2      225  bb    225
3    mul       2      NaN  bb    NaN
4    mul       3      NaN  cc    NaN
>>> del df3['price_1']
>>> df3
  method  number loc  price
0    add       1  aa    200
1    add       1  aa    200
2    mul       2  bb    225
3    mul       2  bb    NaN
4    mul       3  cc    NaN

Another way to do this is to use pandas.Series.fillna() method:

>>> df3['price'] = df3['price'].fillna(df3['price_1'])
>>> del df3['price_1']
>>> df3
  method  number loc  price
0    add       1  aa    200
1    add       1  aa    200
2    mul       2  bb    225
3    mul       2  bb    NaN
4    mul       3  cc    NaN

Upvotes: 2

Related Questions