Michael B. Currie
Michael B. Currie

Reputation: 14656

Pandas concatenate, with no duplicate indices or columns

The pandas docs give an example of concat that combines indices (axis=0), by concatenating along the columns (axis=1):

In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
   ...:                     'B': ['B0', 'B1', 'B2', 'B3'],
   ...:                     'C': ['C0', 'C1', 'C2', 'C3'],
   ...:                     'D': ['D0', 'D1', 'D2', 'D3']},
   ...:                     index=[0, 1, 2, 3])
   ...: 
In [2]: df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
   ...:                  'D': ['D2', 'D3', 'D6', 'D7'],
   ...:                  'F': ['F2', 'F3', 'F6', 'F7']},
   ...:                 index=[2, 3, 6, 7])
   ...: 

In [3]: result = pd.concat([df1, df4], axis=1)

enter image description here

Note that df1 and df4 share indices 2 and 3, and columns B and D.

concat has not duplicated the shared indices, but it has duplicated the columns.

How can I also avoid duplicating the columns?

That is, I want a result to have:

If any data clashes, I want an exception to be raised.

Upvotes: 4

Views: 5087

Answers (4)

RusJaI
RusJaI

Reputation: 794

This approach worked for me.! just 1 line of code, with pandas append()

df1=df1.append(df4,ignore_index=True,axis=1)

Upvotes: 0

Michael B. Currie
Michael B. Currie

Reputation: 14656

I'm essentially asking for an "upsert" (insert, update) operation. So that's an approach that could work:

First, the "insert", of rows that don't currently exist in df1:

# Add all rows from df4 that don't currently exist in df1
result = pd.concat([df1, df4[~df4.index.isin(df1.index)]])

Then, check for clashes in the rows that are common to both DataFrames and thus must be updated:

# Obtain a sliced version of df1, showing only
# the columns and rows shared with the df4
df1_sliced = \
    result.loc[result.index.isin(df4.index),
               result.columns.isin(df4.columns)]
df4_sliced = \
    df4.loc[df4.index.isin(df1.index),
            df4.columns.isin(df1.columns)]

# Obtain a mask of the conflicts in the current segment
# as compared with all previously loaded data.  That is:
# NaN NaN = False
# NaN 2   = False
# 2   2   = False
# 2   3   = True
# 2   NaN = True
data_conflicts = (pd.notnull(df1_sliced) & 
                  (df1_sliced != df4_sliced))

if data_conflicts.any().any():
    raise AssertionError("Data from this segment conflicted "
                         "with previously loaded data:\n", 
                         data_conflicts)

Finally, perform the update:

# Replace any rows that do exist with the cur_df version
result.update(df4)

The result is the same as Happy001's answer. Not sure which is more efficient. Coming from an SQL background, my answer is more understandable to me.

print(result)

     A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2   F2
3   A3  B3   C3  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7

Upvotes: 4

Happy001
Happy001

Reputation: 6383

result = df1.join(df4, rsuffix='_dup', how='outer')

#check data clashes
dup_cols = [c for c in result if c.endswith('_dup')]
for c in dup_cols:
    if (result[[c[:-4], c]].dropna().apply(pd.Series.nunique, axis=1) > 1).any():
        raise Exception("There are conflicts in column %s from two DataFrames" % c[:-4])

result.update(df4)

#remove duplicated cols, since data have been put into 1st occurence of the col
result = result[[c for c in result if not c.endswith('_dup')]]

print result

     A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2   F2
3   A3  B3   C3  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7

Upvotes: 0

lowtech
lowtech

Reputation: 2582

Try:

pandas.merge(df1, df4, left_index = True, right_index = True, how = 'outer')

you may have to rename columns to match your expectations.

Upvotes: 0

Related Questions