susim
susim

Reputation: 231

How do I combine_first, indexing specific column two DataFrames into one?

After mapping by column c, If column A has a value, insert the value of column A; if not, insert column B.

data1                               data2

a    b    c                      a    c    d
a1   b1   c1                     1a   c1   1d   
     b2   c2                     2a   c2   2d
a3        c3                     3a   c3   3d
                                 4a   c4   4d

The result I want

  result
   a    b   c     
   a1   b1  c1
   2a   b2  c2
   a3       c3

I tried the following, but I was not satisfied.

->>> result = data1.merge(data2, on=['c'])
Prefixes _x and _y are created. combine_first is not applied.

->>> result = data1.combine_first(data2)
It is not mapped by column c.

How do I get good results? I ask for your help. thank you

Upvotes: 2

Views: 3212

Answers (3)

Scott Boston
Scott Boston

Reputation: 153460

Using @IdoS setup:

import pandas as pd
data1 = pd.DataFrame({'a': ['a1', None, 'a3'],
                      'b': ['b1', 'b2', None],
                      'c': ['c1', 'c2', 'c3']})

data2 = pd.DataFrame({'a': ['1a', '2a', '3a', '4a'],
                      'c': ['c1', 'c2', 'c3', 'c4'],
                      'd': ['1d', '2d', '3d', '4d']})

You can use set_index, combine_first, and reindex:

df_out = data1.set_index('c').combine_first(data2.set_index('c'))\
     .reindex(data1.c)\
     .reset_index()

df_out

Output:

    c   a     b   d
0  c1  a1    b1  1d
1  c2  2a    b2  2d
2  c3  a3  None  3d

Upvotes: 2

YOLO
YOLO

Reputation: 21709

You can try this way as well:

# set indexes
data1 = data1.set_index('c')
data2 = data2.set_index('c')

# join data on indexes
datax = data1.join(data2.drop('d', axis=1), rsuffix='_rr').reset_index()

# fill missing value in column a
datax['a'] = datax['a'].fillna(datax['a_rr'])

# drop unwanted columns
datax.drop('a_rr', axis=1, inplace=True)

# fill missing values with blank spaces
datax.fillna('', inplace=True)

# output
    a   b   c
0   a1  b1  c1
1   2a  b2  c2
2   a3      c3

# data used
data1 = pd.DataFrame({'a':['a1','','a3'],
                      'b':['b1','b2',''],
                      'c':['c1','c2','c3']})

data2 = pd.DataFrame({'a':['1a','2a','3a','4a'],
                      'c':['c1','c2','c3','c4'],
                      'd':['1d','2d','3d','4d']})

Upvotes: 1

Ido S
Ido S

Reputation: 1352

I'm not 100% clear on how you indexed your dataframes (data1 and data2), but if you index them on column 'c' it should work.

This is how I created your data:

import pandas as pd
data1 = pd.DataFrame({'a': ['a1', None, 'a3'],
                      'b': ['b1', 'b2', None],
                      'c': ['c1', 'c2', 'c3']})

data2 = pd.DataFrame({'a': ['1a', '2a', '3a', '4a'],
                      'c': ['c1', 'c2', 'c3', 'c4'],
                      'd': ['1d', '2d', '3d', '4d']})

I then set the index of both to be column 'c':

data1 = data1.set_index('c')
data2 = data2.set_index('c')

Then I use combine_first as you do:

data_combined = data1.combine_first(data_2)

And I get this:

    a   b   d
c           
c1  a1  b1  1d
c2  2a  b2  2d
c3  a3  None    3d
c4  4a  NaN 4d

Not sure why you don't want the row with index 'c4' or column 'd', but it's easy enough to remove them:

data_combined = data_combined.drop('d', axis=1)
data_combined = data_combined.loc[data_combined.index != 'c4']

And then I do some re-ordering to get your desired result:

data_combined = data_combined.reset_index()
data_combined = data_combined[['a', 'b', 'c']]
data_combined = data_combined.fillna('')


    a   b   c
0   a1  b1  c1
1   2a  b2  c2
2   a3      c3

Upvotes: 1

Related Questions