user14419478
user14419478

Reputation: 33

How to merge two dataframes with different lengths in python

I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.

Could I please know how to merge them, maintaining the 'Week' indexing?

[df1]

Week              Coeff1      
1               -0.456662
1               -0.533774
1               -0.432871
1               -0.144993
1               -0.553376
...                   ...
53              -0.501221
53              -0.025225
53               1.529864
53               0.044380
53              -0.501221
[16713 rows x 1 columns]

[df2]

Week               Coeff    
1                 0.571707
1                 0.086152
1                 0.824832
1                -0.037042
1                 1.167451
...                    ...
53               -0.379374
53                1.076622
53               -0.547435
53               -0.638206
53                0.067848
[63265 rows x 1 columns]

I've tried this code:

df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3

But it gave me a new df (df3) with 13386431 rows × 2 columns

Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.

Upvotes: 2

Views: 3786

Answers (3)

Lukas Kaspras
Lukas Kaspras

Reputation: 458

I assume your output should look somewhat like this:

Week Coeff1 Coeff2
1 -0.456662 0.571707
1 -0.533774 0.086152
1 -0.432871 0.824832
2 3 3
2 NaN 3

Don't mind the actual numbers though. The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique. So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.

I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!

Now is later:

What you actually want to do is to concatenate the Dataframes "per week". You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:

weekly_dfs=[]
for week in df1.Week.unique():
    sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
    sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
    concat_df = pd.concat([sub_df1, sub_df2], axis=1)
    concat_df["Week"] = week
    weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)

The last reset of the index is optional but I recommend it anyways!

Upvotes: 1

Salma Elshahawy
Salma Elshahawy

Reputation: 1190

According to pandas' merge documentation, you can use merge in a way like that:

What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:

df2.merge(df1,how='left', left_on='Week', right_on='Week')

note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.

Upvotes: 0

Giuseppe Accaputo
Giuseppe Accaputo

Reputation: 2642

Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:

df3 = pd.concat([df1,df2], ignore_index=True, axis=1)

The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.

Upvotes: 0

Related Questions