Eli O.
Eli O.

Reputation: 2101

Combine and complete values of two pandas dataframes from each other

I have 2 dataframes with missing values that I want to merge and complete data from each other,

A simple visualisation :

df1 :
A,B,C
A1,B1,C1
A2,B2,
A3,B3,C3 

df2 :
A,B,C
A1,,C1
A4,B4,C4
A2,B2,C2

The result wanted:
A,B,C
A1,B1,C1
A2,B2,B2
A3,B3,C3
A4,B4,C4

Basically merge the dataframes without duplicates of the column "A" and completing if there are missing values in a row by comparing values from same column "A" between dataframes.

I tried many things saw on the Pandas documentation + solutions on stackexchange but failed everytime

These are all the different things I tried :

pd.merge_ordered(df1, df2, fill_method='ffill', left_by='A')
df1.combine_first(df2)
df1.update(df2)
pd.concat([df1, df2])
pd.merge(df1, df2, on=['A','B','C'], how='right')
pd.merge(df1, df2, on=['A','B','C'], how='outer')
pd.merge(df1, df2, on=['A','B','C'], how='left')
df1.join(df2, how='outer')
df1.join(df2, how='left')
df1.set_index('A').join(df2.set_index('A'))

(You can see I was quite desperate at the end)

Any idea how to do that ?

Upvotes: 3

Views: 212

Answers (4)

piRSquared
piRSquared

Reputation: 294488

Setup
Since you wrote them as csvs, I'm going to assume they were csvs.

df1 = pd.read_csv('df1.csv', sep=',', index_col=0)
df2 = pd.read_csv('df2.csv', sep=',', index_col=0)

Solution
Use fillna after having used align

pd.DataFrame.fillna(*df1.align(df2))

     B   C
A         
A1  B1  C1
A2  B2  C2
A3  B3  C3
A4  B4  C4

You can use reset_index if you insist but I think it's prettier to leave it as it is.

Upvotes: 4

BENY
BENY

Reputation: 323326

Or you can use first

pd.concat([df1,df2]).replace('',np.nan).groupby('A',as_index=False).first()
Out[53]: 
    A   B   C
0  A1  B1  C1
1  A2  B2  C2
2  A3  B3  C3
3  A4  B4  C4

Upvotes: 4

datawrestler
datawrestler

Reputation: 1567

You can use the pandas categorical data type to set an ordered list of categories, sort of these ordered categories, and drop rows with Null values to get your desired results:

from pandas.api.types import CategoricalDtype
# Create first dataframe from OP values
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'C': ['C1', '', 'C3']})

# create second dataframe from original values
df2 = pd.DataFrame({'A': ['A1', 'A4', 'A2'],
                    'B': ['', 'B4', 'B2'],
                    'C': ['C1', 'C4', 'C2']})

# concatenate the two together for a long dataframe
final = pd.concat([df1, df2])

# specify the letters in your dataset  
letters = ['A', 'B', 'C']
# create a placeholder dictionary to store the categorical datatypes
cat_dict = {}

# iterate over the letters
for let in letters:
    # create the ordered categories - set hte range for the max # of values
    cats = ['{}{}'.format(let, num) for num in list(range(1000))]
    # create ordered categorical datatype
    cat_type = CategoricalDtype(cats, ordered=True)
    # insert into placeholder
    cat_dict[let] = cat_type

# properly format your columns as the ordered categories
final['A'] = final['A'].astype(cat_dict['A'])
final['B'] = final['B'].astype(cat_dict['B'])
final['C'] = final['C'].astype(cat_dict['C'])
# finally sort on the three columns and drop rows with NA values
final.sort_values(['A', 'B', 'C']).dropna(how='any')

# which outputs desired results
    A   B   C
0  A1  B1  C1
2  A2  B2  C2
2  A3  B3  C3
1  A4  B4  C4

While this is a bit longer, one nice thing about doing it this way is your data can be in any order upon input. This inserts an inherit rank into the values within each column, so A1 < A2 < A3 and so on and so forth. This also enables you to sort the columns.

Upvotes: 1

cs95
cs95

Reputation: 402854

Did you try combine_first with A as the index?

df1.set_index('A').combine_first(df2.set_index('A')).reset_index()

    A   B   C
0  A1  B1  C1
1  A2  B2  C2
2  A3  B3  C3
3  A4  B4  C4

Upvotes: 4

Related Questions