Reputation: 645
I have several large dataframes which are built up from a vehicle log. As only one message can be present on the CAN bus (vehicle communication protocol) at any time.
This is a simlipied dataframe without any interpolation:
time messageA1 messageA2 messageA3 messageB1 messageB2 message C1 messageC2
0 1 2 1 NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN 3 2
2 NaN NaN NaN 3 7 NaN NaN
And this can continue for millions of rows with NaN values consisting of about 95% of the entire dataframe. I have read that when a NaN/Null/None value is within a dataframe it is float64 value.
My questions:
Upvotes: 1
Views: 941
Reputation: 3490
Is a float64 value allocated for every NaN value?
If yes, does it do this memory efficiently?
No it does not, instead you are supposed to use a sparse data structure;
Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?
Yes it will, on all those operations that are O(f(N))
, depending on the f(N)
. Think of you averaging data, for instance. You'll have to check if any is NaN, then don't use it (or maybe consider it 0, it depends) and this is just overhead.
You might want to compare the shear size of dense (your current implementation) against spares data structures in your case:
'dense : {:0.2f} Kbytes'.format(df.memory_usage().sum() / 1e3)
'sparse: {:0.2f} Kbytes'.format(sdf.memory_usage().sum() / 1e3)
The two numbers should be pretty different
Upvotes: 1