Himanshu Singhal
Himanshu Singhal

Reputation: 115

Pandas Merge Two DataFrames results in Memory Leak

Problem Statement:

I have to perform SQL table like joins on multiple CSV files recursively. Example: I have files CSV1, CSV2, CSV3, .....CSVn

I need to perform joins(Inner/Outer/Left/Full) between two CSV at a time and the joined result with third CSV and so on till all CSV got merged.

What I have Tried:

I am using the pandas library merge method(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) to merge the data frames of CSV's file.

Code Snippet:

import pandas as pd
 
df1 = pd.read_csv(path_of_csv1)
df2 = pd.read_csv(path_of_csv2)
resultant_df = df1.merge(df2, left_on='left_csv_column_name', right_on='right_csv_column_name', how='inner')
.....

I am using the pandas version 1.1.0 and python version 3.8.5

Problem I am facing:

I am using Mac Book Pro with 8Gb Ram and trying to merge the DataFrames inside and outside the docker container. For smaller CSV files of around 10Mb each, I am able to merge some files successfully but for some bigger CSV files, let's say of 50Mb each I am facing the Memory Leak issue. Before starting the merge operation My system has 3.5 GB of available ram(checked with docker stats <container_name>) allocated to docker out of 6 GB and once Starting the merge process docker consumes the entire available RAM and merge process terminated in between with a kill-9 signal error.

I have tried merging them outside the container also. The same memory issue still persists and my process/terminal hangs in between.

PS: Pardon If wrote something wrong.

Any help would be much appreciated. I am totally stuck in this merging process.

Upvotes: 1

Views: 3738

Answers (1)

efont
efont

Reputation: 246

Your problem

I don't think you have a memory leak but that the resulting merged dataframe is too big to fit in memory. Indeed a memory leak is when objects are not properly deleted by the system (garbage collected) and accumulate, causing the memory to inflate with time.

Even if both of your dataframes can hold in RAM, the merge operation can result in a much bigger dataframe thus causing your memory issues. This may happen for exemple if you have a lot of duplicated values in your merging columns:

>>> df1 = pd.DataFrame({'col': ['a', 'a', 'a', 'b']})
>>> df2 = pd.DataFrame({'col': ['a', 'a', 'a', 'c']})
>>> df1.merge(df2, on='col')
  col
0   a
1   a
2   a
3   a
4   a
5   a
6   a
7   a
8   a

Here we have 9 rows in the resulting dataframe, which is more than in the initial dataframes combined! That is because the merge operation create the cartesian product of the data (here: row 1 of df1 is merged with row 1, 2 and 3 of df; row 2 of df1 is merged with row 1, 2 and 3 etc.)

Now imagine the worst case where you have only one value in both your merging columns. If you have say 10^5 rows in each df you will end up with 10^10 rows. This could be your issue.

Solution

To perform the merge of dataframes that can't fit in memory, or whose result is too big, you can try to use the dask library. See for example answers to this SO question.

Upvotes: 2

Related Questions