Reputation: 2101
I have 2 dataframes with missing values that I want to merge and complete data from each other,
A simple visualisation :
df1 :
A,B,C
A1,B1,C1
A2,B2,
A3,B3,C3
df2 :
A,B,C
A1,,C1
A4,B4,C4
A2,B2,C2
The result wanted:
A,B,C
A1,B1,C1
A2,B2,B2
A3,B3,C3
A4,B4,C4
Basically merge the dataframes without duplicates of the column "A" and completing if there are missing values in a row by comparing values from same column "A" between dataframes.
I tried many things saw on the Pandas documentation + solutions on stackexchange but failed everytime
These are all the different things I tried :
pd.merge_ordered(df1, df2, fill_method='ffill', left_by='A')
df1.combine_first(df2)
df1.update(df2)
pd.concat([df1, df2])
pd.merge(df1, df2, on=['A','B','C'], how='right')
pd.merge(df1, df2, on=['A','B','C'], how='outer')
pd.merge(df1, df2, on=['A','B','C'], how='left')
df1.join(df2, how='outer')
df1.join(df2, how='left')
df1.set_index('A').join(df2.set_index('A'))
(You can see I was quite desperate at the end)
Any idea how to do that ?
Upvotes: 3
Views: 212
Reputation: 294488
Setup
Since you wrote them as csvs, I'm going to assume they were csvs.
df1 = pd.read_csv('df1.csv', sep=',', index_col=0)
df2 = pd.read_csv('df2.csv', sep=',', index_col=0)
Solution
Use fillna
after having used align
pd.DataFrame.fillna(*df1.align(df2))
B C
A
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
You can use reset_index
if you insist but I think it's prettier to leave it as it is.
Upvotes: 4
Reputation: 323326
Or you can use first
pd.concat([df1,df2]).replace('',np.nan).groupby('A',as_index=False).first()
Out[53]:
A B C
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
Upvotes: 4
Reputation: 1567
You can use the pandas categorical data type to set an ordered list of categories, sort of these ordered categories, and drop rows with Null values to get your desired results:
from pandas.api.types import CategoricalDtype
# Create first dataframe from OP values
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', '', 'C3']})
# create second dataframe from original values
df2 = pd.DataFrame({'A': ['A1', 'A4', 'A2'],
'B': ['', 'B4', 'B2'],
'C': ['C1', 'C4', 'C2']})
# concatenate the two together for a long dataframe
final = pd.concat([df1, df2])
# specify the letters in your dataset
letters = ['A', 'B', 'C']
# create a placeholder dictionary to store the categorical datatypes
cat_dict = {}
# iterate over the letters
for let in letters:
# create the ordered categories - set hte range for the max # of values
cats = ['{}{}'.format(let, num) for num in list(range(1000))]
# create ordered categorical datatype
cat_type = CategoricalDtype(cats, ordered=True)
# insert into placeholder
cat_dict[let] = cat_type
# properly format your columns as the ordered categories
final['A'] = final['A'].astype(cat_dict['A'])
final['B'] = final['B'].astype(cat_dict['B'])
final['C'] = final['C'].astype(cat_dict['C'])
# finally sort on the three columns and drop rows with NA values
final.sort_values(['A', 'B', 'C']).dropna(how='any')
# which outputs desired results
A B C
0 A1 B1 C1
2 A2 B2 C2
2 A3 B3 C3
1 A4 B4 C4
While this is a bit longer, one nice thing about doing it this way is your data can be in any order upon input. This inserts an inherit rank into the values within each column, so A1 < A2 < A3 and so on and so forth. This also enables you to sort the columns.
Upvotes: 1
Reputation: 402854
Did you try combine_first
with A
as the index?
df1.set_index('A').combine_first(df2.set_index('A')).reset_index()
A B C
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
Upvotes: 4