lucky1928
lucky1928

Reputation: 8905

pandas - merge multiple dataframe with reduce

I saw many post talk about use reduce to merge multiple dataframe. I try below code, it works with three dataframes, if it greater than three, tan, error reported:

pandas.errors.MergeError: Passing 'suffixes' which cause duplicate columns {'value_x'} is not allowed.

code:

import pandas as pd
from functools import reduce

data1 = [
    ['A',1],
    ['B',2],
    ['C',3]]
data2 = [
    ['A',4],
    ['B',5],
    ['C',6]]
data3 = [
    ['A',7],
    ['B',8],
    ['C',9]]
data4 = [
    ['A',10],
    ['B',11],
    ['C',12]]

dfs = []
for data in [data1, data2, data3, data4]:
    df = pd.DataFrame(data, columns=['name', 'value'])
    dfs.append(df)

dfm = reduce(
    lambda left, right: pd.merge(
        left, right,
        on=['name'],
        how='outer',
        suffixes=('_x', '_y')),
    dfs)
print(dfm)

What's proper way to fix this error?

Upvotes: 0

Views: 72

Answers (2)

PaulS
PaulS

Reputation: 25418

Another approach, which renames the value column as the list of dfs is being created. Afterwards, just use reduce:

# creating the list of dfs and renaming the 'value' column
dfs = []
for idx, data in enumerate([data1,data2,data3,data4]):
    df = pd.DataFrame(data,columns=['name',f'value_{idx+1}'])
    dfs.append(df)

reduce(lambda left, right: pd.merge(left, right, on='name', how='outer'), dfs)

Alternatively, we can rename column value inside reduce, by using list comprehension:

reduce(
    lambda left, right: pd.merge(left, right, on='name', how='outer'),
    # renaming columns of dfs
    [x.rename(columns={'value': f'value_{idx+1}'}) for idx, x in enumerate(dfs)])

Output:

  name  value_1  value_2  value_3  value_4
0    A        1        4        7       10
1    B        2        5        8       11
2    C        3        6        9       12

Upvotes: 0

mozway
mozway

Reputation: 262224

The problem is that after the third iteration, the left dataframe contains value_x and value as columns, then right one contains value, but adding the _x suffix would cause value to become value_x that is already existing.

You should rather use concat here:

out = (pd.concat([d.set_index('name').add_suffix(f'_{i}')
                  for i, d in enumerate(dfs, start=1)],
                 axis=1)
         .reset_index()
      )

Output:

  name  value_1  value_2  value_3  value_4
0    A        1        4        7       10
1    B        2        5        8       11
2    C        3        6        9       12

To achieve the same with reduce would be a bit convoluted:

dfm = reduce(lambda left, right:
             pd.merge(left, right[1], on=['name'],
                      how='outer', suffixes=(None, f'_{right[0]}')),
             list(enumerate(dfs[1:], start=2)),
             dfs[0])

Output:

  name  value  value_2  value_3  value_4
0    A      1        4        7       10
1    B      2        5        8       11
2    C      3        6        9       12

Upvotes: 0

Related Questions