Julia
Julia

Reputation: 109

Convert column float64/int64 to column with float/int as type in pandas dataframe

I wanted to save my pandas dataframe as a Stata file and there seems to be a problem with having columns with int64 or float64 types and thus need to be converted to standard Python types int and float. I have searched a lot but not found a solution to my problem as no solution has worked for me.

I have tried using something like:

import numpy as np
def conversion(obj):
    if isinstance(obj, np.generic):
        return np.asscalar(obj)

mergeddfnew["speech_main_wordspersentcount_wc"]=mergeddfnew["speech_main_wordspersentcount_wc"].apply(conversion)

I also tried astype. The type of the column always stays the same.

Upvotes: 1

Views: 5274

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375415

See the IO section of the docs:

Stata data files have limited data type support; only strings with 244 or fewer characters, int8, int16, int32, float32 and float64 can be stored in .dta files. Additionally, Stata reserves certain values to represent missing data. Exporting a non-missing value that is outside of the permitted range in Stata for a particular data type will retype the variable to the next larger size. For example, int8 values are restricted to lie between -127 and 100 in Stata, and so variables with values above 100 will trigger a conversion to int16. nan values in floating points data types are stored as the basic missing data type (. in Stata).

However, pandas tries its best to overcome some of these limitations and convert for you:

The Stata writer gracefully handles other data types including int64, bool, uint8, uint16, uint32 by casting to the smallest supported type that can represent the data. For example, data with a type of uint8 will be cast to int8 if all values are less than 100 (the upper bound for non-missing int8 data in Stata), or, if values are outside of this range, the variable is cast to int16.

Which is to say, it seems your column doesn't satisfy these conditions.

I would try manually converting it to something supported in dta like int32 (supposing it's int):

df["speech_main_wordspersentcount_wc"].astype(np.int32)
df["speech_main_wordspersentcount_wc"] = df["speech_main_wordspersentcount_wc"].astype(np.int32)

Upvotes: 1

Related Questions