Reputation: 31
When merging two sparse dataframes the resulting dataframe becomes disproportionate large in memory. I am wondering why this is the case. Operations on the new dataframe are quite slow. I have tried different approaches to reduce the memory footprint but it didn't work. For instance using different fill_values (0 or 0.0), converting back and forth between dense and sparse columns, resetting the index, dropping the indicator column, making a copy of the merged dataframe.
Any ideas what causes this issue and how it can be fixed? I'm working with pandas version 1.1.1.
Here is some info about the dataframes:
DF1:
Int64Index: 113774 entries, 0 to 113773
Columns: 24155 entries
dtypes: Sparse[float32, 0](1), Sparse[float64, 0](24149), float32(2), int32(2), int8(1)
memory usage: 7.3 MB
DF2:
Int64Index: 128507 entries, 0 to 128506
Columns: 1962 entries
dtypes: Sparse[float64, 0](1957), float32(1), int16(1), int32(2), int8(1)
memory usage: 10.0 MB
Merged DF:
Int64Index: 136333 entries, 0 to 136332
Columns: 26115 entries
dtypes: Sparse[float64, 0](26107), category(1), float32(4), int32(2), int8(1)
memory usage: 6.3 GB
This is how I constructed the new dataframe:
df_joined= df1.merge(
df2,
on=key_cols,
how='outer',
indicator='df_indicator',
suffixes=['_DF1', '_DF2']
)
# replace null values
null_cols = pp.get_null_columns(df_joined)
for field in null_cols:
df_joined[field]= df_joined[field].fillna(0.0)
Upvotes: 3
Views: 914
Reputation: 4521
I investigated a bit in this problem and still find it strange. So this is not meant as an answer, but just to show part of the strange behavior of pandas.
It looks like the problem is, that if fillna
is used to fill in the value which is set up as fill_value
of the sparse array, it is not recognized as the fill_value
and so stored as a regular float. Please see the following code. Only after reorganizing the sparse column, the space get freed.
import pandas as pd
import numpy as np
def make_sparse(df):
for col in df.columns[1:]:
# make sure we have plenty zeros in the dataframe columns
# by setting everything lower than 0 and all greather than 0.2
# to zero
# skip the first column
indexer= df[col].between(0.0, 0.2)
df.loc[~indexer, col]= 0.0
df[col]= pd.arrays.SparseArray(df[col], fill_value=0.0)
return df
# build a dataframe with one regular float column and two sparse columns
df1= pd.DataFrame(np.random.randn(10000, 3), columns=['num1', 'num2', 'num3'])
df1= make_sparse(df1)
df1['id']= list(range(df1.shape[0]))
df1.set_index('id', inplace=True)
df1.memory_usage()
"""
This results in the following output in my example
num1 80000
num2 9408
num3 9432
You can see, that num1 and num2 need much less space due to the sparse structure
"""
# create another dataframe with one regular and one sparse column
# the index is the same as for df1, but we only have one row for
# each other record in df1 (the ones with an even id
df2= pd.DataFrame(np.random.randn(5000, 2), columns=['num_a', 'num_b'])
df2= make_sparse(df2)
df2['id']= list(range(0, df2.shape[0]*2, 2))
df2.set_index('id', inplace=True)
df2.memory_usage()
"""
Index 40000
num_a 40000
num_b 5124
"""
# now merge the two
df3= df1.merge(df2, left_index=True, right_index=True, how='left')
df3.memory_usage()
"""
We still have the same memory usage of the columns from df1
also the column num_a is ok, since we have it now for each row
in df1, but num_b needs much more space as before, because it contains
Na-Values
Index 407680
num1 80000
num2 9408
num3 9432
num_a 80000
num_b 65124
"""
# so let's try replacing the Na-values by 0.0
df4['num_b']= df4['num_b'].fillna(0.0)
df3.memory_usage()
"""
This doesn't change anything.
It looks like the 0.0 inserted for the Nas are just
treated as any other float values and not like the fill_value
of the sparse array:
Index 407680
num1 80000
num2 9408
num3 9432
num_a 80000
num_b 65124
"""
# now rebuild the sparse column num_b
df4= df3.copy()
df4['num_b']= pd.arrays.SparseArray(df4['num_b'], fill_value=0.0)
df4.memory_usage()
"""
Now num_b has the same memeory usage as in df2
Index 407680
num1 80000
num2 9408
num3 9432
num_a 80000
num_b 5124
"""
Is there a reason for this behavior? Is this maybe a bug in pandas?
Upvotes: 3