Mahipal Singh
Mahipal Singh

Reputation: 154

Pandas: concat with duplicated index

I am trying to do concentration for four DataFrames. df has unique index and other 3 of them has duplicated values in index. Here's my code:

import pandas as pd

data = {'id':['1','2','3','4','5','6'],
        'name':['A1','A2','A2','A1','A3','A3'],
        'date_create':['28-01-2020','15-02-2020','15-03-2020',
                       '25-03-2020','01-04-2020','30-05-2020']}
df = pd.DataFrame(data)

data = {'userid':['1','2','3','3','5','6'],
        'likesDate':['28-01-2020','15-02-2020','15-03-2020',
                       '25-03-2020','01-04-2020','30-05-2020']}
df1 = pd.DataFrame(data)

data = {'userid':['1','1','2','4','6','6'],
        'dislikesDate':['8-01-2020','1-02-2020','15-03-2020',
                       '25-03-2020','01-04-2020','30-05-2020']}
df2 = pd.DataFrame(data)

data = {'userid':['1','2','2','4','4','6'],
        'DeleteDate':['28-01-2020','15-02-2020','15-03-2020',
                       '25-03-2020','05-04-2020','23-05-2020']}
df3 = pd.DataFrame(data)

df.set_index('id', inplace=True)
df1.set_index('userid', inplace=True)
df2.set_index('userid', inplace=True)
df3.set_index('userid', inplace=True)

df_final = pd.concat([df,df1, df2,df3],axis=1)

print(df_final)

I am getting this error:

ValueError: Shape of passed values is (7, 5), indices imply (6, 5)

What,where am I doing wrong?

Upvotes: 0

Views: 1065

Answers (1)

Ben.T
Ben.T

Reputation: 29635

IIUC, you can chain the join instead of using concat as you have duplicated index values. If you have only 3 dataframes, you can probably write it fully:

df_final = df.join(df1).join(df2).join(df3)
print(df_final.head())
#   name date_create   likesDate dislikesDate  DeleteDate
# 1   A1  28-01-2020  28-01-2020    8-01-2020  28-01-2020
# 1   A1  28-01-2020  28-01-2020    1-02-2020  28-01-2020
# 2   A2  15-02-2020  15-02-2020   15-03-2020  15-02-2020
# 2   A2  15-02-2020  15-02-2020   15-03-2020  15-03-2020
# 3   A2  15-03-2020  15-03-2020          NaN         NaN

but in case you want a more flexible solution that can accept more dataframes, then use reduce

from functools import reduce

df_final = (
    reduce(lambda x,y:x.join(y), [df,df1, df2,df3]) # same a chaining joins
      .rename_axis('user_id').reset_index() # user id back as a column if needed
)
print(df_final)
#    user_id name date_create   likesDate dislikesDate  DeleteDate
# 0        1   A1  28-01-2020  28-01-2020    8-01-2020  28-01-2020
# 1        1   A1  28-01-2020  28-01-2020    1-02-2020  28-01-2020
# 2        2   A2  15-02-2020  15-02-2020   15-03-2020  15-02-2020
# 3        2   A2  15-02-2020  15-02-2020   15-03-2020  15-03-2020
# 4        3   A2  15-03-2020  15-03-2020          NaN         NaN
# 5        3   A2  15-03-2020  25-03-2020          NaN         NaN
# 6        4   A1  25-03-2020         NaN   25-03-2020  25-03-2020
# 7        4   A1  25-03-2020         NaN   25-03-2020  05-04-2020
# 8        5   A3  01-04-2020  01-04-2020          NaN         NaN
# 9        6   A3  30-05-2020  30-05-2020   01-04-2020  23-05-2020
# 10       6   A3  30-05-2020  30-05-2020   30-05-2020  23-05-2020

Upvotes: 1

Related Questions