user1911092
user1911092

Reputation: 4241

Pandas Merge - How to avoid duplicating columns

I am attempting a merge between two data frames. Each data frame has two index levels (date, cusip). In the columns, some columns match between the two (currency, adj date) for example.

What is the best way to merge these by index, but to not take two copies of currency and adj date.

Each data frame is 90 columns, so I am trying to avoid writing everything out by hand.

df:                 currency  adj_date   data_col1 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

df2:                currency  adj_date   data_col2 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

If I do:

dfNew = merge(df, df2, left_index=True, right_index=True, how='outer')

I get

dfNew:              currency_x  adj_date_x   data_col2 ... currency_y adj_date_y
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45             USD         2012-01-03

Thank you! ...

Upvotes: 193

Views: 334688

Answers (11)

rprog
rprog

Reputation: 2130

I use the suffixes option in .merge() followed by drop():

dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_y'))

dfNew.drop(dfNew.filter(regex='_y$').columns, axis=1, inplace=True)

Thanks @ijoseph

Upvotes: 198

william_grisaitis
william_grisaitis

Reputation: 5961

If the indexes are the same (big if true!) you can do:

df = df1.copy()
df[df2.columns] = df2

this similar to merge

pd.merge(df1, df2, index_left=True, index_right=True)

but with no duplicate columns

Upvotes: 3

Wim Yedema
Wim Yedema

Reputation: 41

If you're merging on arbitrary columns and don't want to keep the right key this will do the trick:

mrg = pd.merge(a, b, how="left", left_on="A_KEY", right_on="B_KEY")
mrg.drop(columns=b.columns.difference(cols_to_use))

Upvotes: 0

ThomasAFink
ThomasAFink

Reputation: 1387

You can remove the duplicate y columns you don't want after the join:

# Join df and df2
dfNew = merge(df, df2, left_index=True, right_index=True, how='inner')

Output: currency_x | adj_date_x | data_col1 | ... | currency_y | adj_date_y | data_col2

# Remove the y columns by selecting the columns you want to keep
dfNew = dfNew.loc[:, ("currency_x", "adj_date_x", "data_col1", "data_col2")]

Output: currency_x | adj_date_x | data_col1 | data_col2

Upvotes: 2

Till Hoffmann
Till Hoffmann

Reputation: 9887

You can include duplicate columns in the key to merge on to ensure only a single copy appears in the result.

# Generate some dummy data.
shared = pd.DataFrame({'key': range(5), 'name': list('abcde')})
a = shared.copy()
a['value_a'] = np.random.normal(0, 1, 5)
b = shared.copy()
b['value_b'] = np.random.normal(0, 1, 5)

# Standard merge.
merged = pd.merge(a, b, on='key')
print(merged.columns)  # Index(['key', 'name_x', 'value_a', 'name_y', 'value_b'], dtype='object')

# Merge with both keys.
merged = pd.merge(a, b, on=['key', 'name'])
print(merged.columns)  # Index(['key', 'name', 'value_a', 'value_b'], dtype='object')

This method also ensures that values in columns that appear in both data frames are consistent (e.g. that the currency in both columns is the same). If they are not, the corresponding row will be dropped (if how = 'inner') or occur with missing values (if how = 'outer').

Upvotes: 0

Abimael Domínguez
Abimael Domínguez

Reputation: 507

When the amount of columns you want to avoid is lower than the columns you want to keep... you could use this kind of filtering:

df.loc[:, ~df.columns.isin(['currency', 'adj_date'])]

This will filter all columns in the dataframe except the 'currency' and 'adj_date' columns, you have to write the merge something like this:

    dfNew = merge(df, 
                  df2.loc[:, ~df.columns.isin(['currency', 'adj_date'])], 
                  left_index=True,
                  right_index=True,
                  how='outer')

Note the "~", it means "not".

Upvotes: 0

user6046760
user6046760

Reputation: 514

can't you just subset the columns in either df first?

[i for i in df.columns if i not in df2.columns]
dfNew = merge(df **[i for i in df.columns if i not in df2.columns]**, df2, left_index=True, right_index=True, how='outer')

Upvotes: 0

sophocles
sophocles

Reputation: 13841

This is a bit of going around the problem, but I have written a function that basically deals with the extra columns:

def merge_fix_cols(df_company,df_product,uniqueID):
    
    df_merged = pd.merge(df_company,
                         df_product,
                         how='left',left_on=uniqueID,right_on=uniqueID)    
    for col in df_merged:
        if col.endswith('_x'):
            df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in df_merged if col.endswith('_y')]
            df_merged.drop(to_drop,axis=1,inplace=True)
        else:
            pass
    return df_merged

Seems to work well with my merges!

Upvotes: 3

Elliott Collins
Elliott Collins

Reputation: 770

Building on @rprog's answer, you can combine the various pieces of the suffix & filter step into one line using a negative regex:

dfNew = df.merge(df2, left_index=True, right_index=True,
             how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')

Or using df.join:

dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")

The regex here is keeping anything that does not end with the word "DROP", so just make sure to use a suffix that doesn't appear among the columns already.

Upvotes: 28

EdChum
EdChum

Reputation: 394459

You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.

cols_to_use = df2.columns.difference(df.columns)

Then perform the merge (note this is an index object but it has a handy tolist() method).

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

This will avoid any columns clashing in the merge.

Upvotes: 238

JulienD
JulienD

Reputation: 3652

I'm freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answer and this one from Stackoverflow

sales.csv

    city;state;units
    Mendocino;CA;1
    Denver;CO;4
    Austin;TX;2

revenue.csv

    branch_id;city;revenue;state_id
    10;Austin;100;TX
    20;Austin;83;TX
    30;Austin;4;TX
    47;Austin;200;TX
    20;Denver;83;CO
    30;Springfield;4;I

merge.py import pandas

def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)


sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')

result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')

When executing the merge command I replace the _x suffix with an empty string and them I can remove columns ending with _y

output.csv

    id;city;state;units;branch_id;revenue;state_id
    0;Denver;CO;4;20;83;CO
    1;Austin;TX;2;10;100;TX
    2;Austin;TX;2;20;83;TX
    3;Austin;TX;2;30;4;TX
    4;Austin;TX;2;47;200;TX

Upvotes: 7

Related Questions