Does storing a large amout of NaN values in a large panda dataframe massively effect performance and memory usage?

Question

I have several large dataframes which are built up from a vehicle log. As only one message can be present on the CAN bus (vehicle communication protocol) at any time.

This is a simlipied dataframe without any interpolation:

time    messageA1    messageA2    messageA3    messageB1    messageB2    message C1    messageC2
0       1            2            1            NaN          NaN          NaN           NaN
1       NaN          NaN          NaN          NaN          NaN          3             2
2       NaN          NaN          NaN          3            7            NaN           NaN

And this can continue for millions of rows with NaN values consisting of about 95% of the entire dataframe. I have read that when a NaN/Null/None value is within a dataframe it is float64 value.

My questions:

Is a float64 value allocated for every NaN value?
If yes, does it do this memory efficiently?
Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?

Daemon Painter · Accepted Answer

Is a float64 value allocated for every NaN value?

Yes it is;

If yes, does it do this memory efficiently?

No it does not, instead you are supposed to use a sparse data structure;

Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?

Yes it will, on all those operations that are O(f(N)), depending on the f(N). Think of you averaging data, for instance. You'll have to check if any is NaN, then don't use it (or maybe consider it 0, it depends) and this is just overhead.

You might want to compare the shear size of dense (your current implementation) against spares data structures in your case:

'dense : {:0.2f} Kbytes'.format(df.memory_usage().sum() / 1e3)
'sparse: {:0.2f} Kbytes'.format(sdf.memory_usage().sum() / 1e3)

The two numbers should be pretty different

Does storing a large amout of NaN values in a large panda dataframe massively effect performance and memory usage?

Answers (1)

Related Questions