Reputation: 14656
The pandas docs give an example of concat
that combines indices (axis=0
), by concatenating along the columns (axis=1
):
In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
...: 'B': ['B0', 'B1', 'B2', 'B3'],
...: 'C': ['C0', 'C1', 'C2', 'C3'],
...: 'D': ['D0', 'D1', 'D2', 'D3']},
...: index=[0, 1, 2, 3])
...:
In [2]: df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
...: 'D': ['D2', 'D3', 'D6', 'D7'],
...: 'F': ['F2', 'F3', 'F6', 'F7']},
...: index=[2, 3, 6, 7])
...:
In [3]: result = pd.concat([df1, df4], axis=1)
Note that df1
and df4
share indices 2
and 3
, and columns B
and D
.
concat
has not duplicated the shared indices, but it has duplicated the columns.
How can I also avoid duplicating the columns?
That is, I want a result
to have:
0
, 1
, 2
, 3
, 6
, 7
, and A
, B
, C
, D
, F
. (no duplicate columns!)If any data clashes, I want an exception to be raised.
Upvotes: 4
Views: 5087
Reputation: 794
This approach worked for me.! just 1 line of code, with pandas append()
df1=df1.append(df4,ignore_index=True,axis=1)
Upvotes: 0
Reputation: 14656
I'm essentially asking for an "upsert" (insert, update) operation. So that's an approach that could work:
First, the "insert", of rows that don't currently exist in df1
:
# Add all rows from df4 that don't currently exist in df1
result = pd.concat([df1, df4[~df4.index.isin(df1.index)]])
Then, check for clashes in the rows that are common to both DataFrames and thus must be updated:
# Obtain a sliced version of df1, showing only
# the columns and rows shared with the df4
df1_sliced = \
result.loc[result.index.isin(df4.index),
result.columns.isin(df4.columns)]
df4_sliced = \
df4.loc[df4.index.isin(df1.index),
df4.columns.isin(df1.columns)]
# Obtain a mask of the conflicts in the current segment
# as compared with all previously loaded data. That is:
# NaN NaN = False
# NaN 2 = False
# 2 2 = False
# 2 3 = True
# 2 NaN = True
data_conflicts = (pd.notnull(df1_sliced) &
(df1_sliced != df4_sliced))
if data_conflicts.any().any():
raise AssertionError("Data from this segment conflicted "
"with previously loaded data:\n",
data_conflicts)
Finally, perform the update:
# Replace any rows that do exist with the cur_df version
result.update(df4)
The result is the same as Happy001's answer. Not sure which is more efficient. Coming from an SQL background, my answer is more understandable to me.
print(result)
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 F2
3 A3 B3 C3 D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
Upvotes: 4
Reputation: 6383
result = df1.join(df4, rsuffix='_dup', how='outer')
#check data clashes
dup_cols = [c for c in result if c.endswith('_dup')]
for c in dup_cols:
if (result[[c[:-4], c]].dropna().apply(pd.Series.nunique, axis=1) > 1).any():
raise Exception("There are conflicts in column %s from two DataFrames" % c[:-4])
result.update(df4)
#remove duplicated cols, since data have been put into 1st occurence of the col
result = result[[c for c in result if not c.endswith('_dup')]]
print result
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 F2
3 A3 B3 C3 D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
Upvotes: 0
Reputation: 2582
Try:
pandas.merge(df1, df4, left_index = True, right_index = True, how = 'outer')
you may have to rename columns to match your expectations.
Upvotes: 0