Merging two Pandas DataFrames with many sparse columns results in a DataFrame that requires a disproportionate large amount of memory

Question

When merging two sparse dataframes the resulting dataframe becomes disproportionate large in memory. I am wondering why this is the case. Operations on the new dataframe are quite slow. I have tried different approaches to reduce the memory footprint but it didn't work. For instance using different fill_values (0 or 0.0), converting back and forth between dense and sparse columns, resetting the index, dropping the indicator column, making a copy of the merged dataframe.

Any ideas what causes this issue and how it can be fixed? I'm working with pandas version 1.1.1.

Here is some info about the dataframes:

DF1:

Int64Index: 113774 entries, 0 to 113773  
Columns: 24155 entries  
dtypes: Sparse[float32, 0](1), Sparse[float64, 0](24149), float32(2), int32(2), int8(1)  
memory usage: 7.3 MB

DF2:

Int64Index: 128507 entries, 0 to 128506  
Columns: 1962 entries  
dtypes: Sparse[float64, 0](1957), float32(1), int16(1), int32(2), int8(1)  
memory usage: 10.0 MB

Merged DF:

Int64Index: 136333 entries, 0 to 136332  
Columns: 26115 entries  
dtypes: Sparse[float64, 0](26107), category(1), float32(4), int32(2), int8(1)  
memory usage: 6.3 GB

This is how I constructed the new dataframe:

df_joined= df1.merge(
    df2, 
    on=key_cols, 
    how='outer', 
    indicator='df_indicator', 
    suffixes=['_DF1', '_DF2']
)

# replace null values
null_cols = pp.get_null_columns(df_joined)
for field in null_cols:
    df_joined[field]= df_joined[field].fillna(0.0)

Merging two Pandas DataFrames with many sparse columns results in a DataFrame that requires a disproportionate large amount of memory

Answers (1)

Related Questions