Reputation: 20362
I am trying to merge two dataframes together. One has 1.5M rows and one has 15M rows. I was expecting the merged dataframe to haev 15M rows, but it actually has 178M rows!! I think my merge is doing some kind of Cartesian product, and this isn not what I want.
This is what I tried, and got 178M rows.
df_merged = pd.merge(left=df_nat, right=df_stack, how='inner', left_on='eno', right_on='eno')
I tried the code below and got an out of memory error.
df_merged = pd.merge(df_nat, df_stack, how='inner', on='eno')
I'm guessing there are dupes in these dataframes, and that's causing the final merge job to blow up. How can I do this so I have a final merged dataframe with 15M rows? Finally, the schemas are different, and only the 'eno' field is the same.
Thanks.
Upvotes: 0
Views: 35
Reputation: 180
Try to remove the dups before merge both. It will greatly reduce memory usage:
df_1 = df_1.drop_duplicates(subset=['enodeb'], keep='last')
df_2 = df_2.drop_duplicates(subset=['enodeb'], keep='last')
If the datasets are too small to fit in the memory, maybe it is a good idea to use dask or vaex to do out-of-core processing.
Upvotes: 1