Segmented
Segmented

Reputation: 2044

Defining dtype of df.to_sparse() result

I have a dataframe df which is sparse and for memory efficiency I wish to convert it using to_sparse()

However it seems that the new representation ends up with the dtype=float64, even when my df is dtype=int8.

Is there a way specify the data type/ prevent auto conversion to dtype=float64 when using to_sparse() ?

Upvotes: 1

Views: 155

Answers (2)

Segmented
Segmented

Reputation: 2044

Looking under the hood, the Pandas sparse frame implementation at pandas.sparse.frame we see that the astype() method is still waiting to be implemented as of release 0.18.0. Ref. Github

When we have some implementation in place, conversion of dtype should work like pandas.core.frame (Pandas DataFrame). Given a Pandas DataFrame df we could convert it to SparseDataFrame and specify dtype

df.to_sparse().astype(dtype)

ATM, SparseDataFrame does not have much support for dtype but it is currently being developed. Refer this issue that I opened Github.

Upvotes: 0

firelynx
firelynx

Reputation: 32224

In short. No.

You see, dtypes is not a pandas controlled entity. Dtypes is typically a numpy thing. Dtypes are not controllable in any way, they are automagically asserted by numpy and can only change when you change the data inside the dataframe or numpy array.

That being said, the typical reason for ending up with a float instead of an int as a dtype is because of the introduction of NaN values into the series or numpy array. This is a pandas gotcha some say. I personally would argue it is due to the (too) close coupling between pandas and numpy.

In general, dtypes should never be trusted for anything, they are incredibly unreliable. I think everyone working with numpy/pandas would live a better life if they were never exposed to dtypes at all.

If you really really hate floats, the only other option for you as far as I know is to use string representations, which of course causes even more problems in most cases.

Upvotes: 1

Related Questions