Reputation: 7654
I have a customers dataset that has three columns, a key, and an X and Y representing the customer coordinates.
In [1]: customers
Out[1]:
key X Y
0 b305b0b7-fb24-4eef-9055-827cf7f39b93 NaN NaN
1 edf5e4bc-d553-4285-9de3-45165f96dd02 987269.0 6236836.0
2 00ded895-ae97-4d27-b317-91c6931662b7 880460.0 6267799.0
3 d957d117-72db-444f-ac9a-338c2034f5db 830645.0 6287647.0
4 ac504435-7eb8-4275-a3da-6f673ff324ae 837826.0 6293434.0
customers.info()
returns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378195 entries, 0 to 378194
Data columns (total 3 columns):
cle 378195 non-null object
X 375850 non-null float64
Y 375850 non-null float64
dtypes: float64(2), object(1)
memory usage: 8.7+ MB
I only keep data points that have X and Y values (not NaN
) using:
customers2 = customers[customers['X'].notna() & customers['Y'].notna()]
Which removes lines with any NaN
in X or Y (example of index 0).
In [2]: customers2
Out[2]:
cle X Y
1 edf5e4bc-d553-4285-9de3-45165f96dd02 987269.0 6236836.0
2 00ded895-ae97-4d27-b317-91c6931662b7 880460.0 6267799.0
3 d957d117-72db-444f-ac9a-338c2034f5db 830645.0 6287647.0
4 ac504435-7eb8-4275-a3da-6f673ff324ae 837826.0 6293434.0
5 5d4c8fe0-56ce-4498-a4ee-2fd4314abc1e 987025.0 6427374.0
But the info of this new data frame shows a higher memory usage, despite the reduction in the number of entries (from 378195 originally to 375850):
In [2]: customers2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 375850 entries, 1 to 378194
Data columns (total 3 columns):
cle 375850 non-null object
X 375850 non-null float64
Y 375850 non-null float64
dtypes: float64(2), object(1)
memory usage: 11.5+ MB
I can't explain why, even I can continue to manipulate customers2
like a normal data frame.
Any explanation? Thanks.
Upvotes: 1
Views: 34
Reputation: 6270
I would say your memory index increased because of your Index, before you have:
RangeIndex: 378195 entries, 0 to 378194
and afterwards:
Int64Index: 375850 entries, 1 to 378194
the Int64Index takes 8bytes per entry.
This is the RangeIndex: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.RangeIndex.html
You can convert it back like:
customers2.index = pd.RangeIndex(start=0, stop=len(customers2), step=1)
Upvotes: 1