How to merge two dataframes and eliminate dupes

Question

I am trying to merge two dataframes together. One has 1.5M rows and one has 15M rows. I was expecting the merged dataframe to haev 15M rows, but it actually has 178M rows!! I think my merge is doing some kind of Cartesian product, and this isn not what I want.

This is what I tried, and got 178M rows.

df_merged = pd.merge(left=df_nat, right=df_stack, how='inner', left_on='eno', right_on='eno')

I tried the code below and got an out of memory error.

df_merged = pd.merge(df_nat, df_stack, how='inner', on='eno')

I'm guessing there are dupes in these dataframes, and that's causing the final merge job to blow up. How can I do this so I have a final merged dataframe with 15M rows? Finally, the schemas are different, and only the 'eno' field is the same.

Thanks.

Sandi · Accepted Answer

Try to remove the dups before merge both. It will greatly reduce memory usage:

df_1 = df_1.drop_duplicates(subset=['enodeb'], keep='last')
df_2 = df_2.drop_duplicates(subset=['enodeb'], keep='last')

If the datasets are too small to fit in the memory, maybe it is a good idea to use dask or vaex to do out-of-core processing.

How to merge two dataframes and eliminate dupes

Answers (1)

Related Questions