Reputation: 551
I want to execute some python functions using data from '.RData' file. I am using the 'pyreadr' python package for the same.
Here is example of R Code
library(data.table)
# Example
data <- data.table(x_num=c(1,1.5,2),
x_int=c(1,2,3))
data$x_int <- as.integer(data$x_int) # Making sure the data is in integer type
data_missing <- data.table(x_num=c(1.5,2,NA,5,6),
x_int=c(1,2,3,NA,5))
data_missing$x_int <- as.integer(data_missing$x_int) # Making sure the data is in integer type
# checking the classes
sapply(data,class)
sapply(data_missing,class)
# Storing the data in RData file
save(data, file = "test_data.RData")
save(data_missing, file = "test_missing_data.RData")
The reason I am storing it in different files is because the 'test_data.RData' is successfully loaded in python, however the 'test_missing_data.RData' is converting values with NA data to object rather than integer datatype.
Here is the Python Code
# Working example
import pyreadr
result=pyreadr.read_r('test_data.RData')
data=result['data']
data.dtypes
# Output
# x_num float64
# x_int int32
# Example where NA values are converted to object datatype
import pyreadr
result=pyreadr.read_r('test_missing_data.RData') # Error
data=result['data_missing']
data.dtypes
# Output
# x_num float64
# x_int object
There is no error message, however I need the datatype to remain in integer even with missing or NA values.
Thank you for your time and help.
Upvotes: 0
Views: 65
Reputation: 3417
At the moment what you describe is the correct behavior of the package. This is because in older versions of pandas, a numpy integer array was used and those do not allow to set a numpy nan value, which is a float, and was the only available missing value representation. Therefore the column type had to be set to object to be able to cope with data in two different types: integer and float.
In more recent times pandas has introduced a nullable integer column type.
Pyreadr will convert those object columns back to an R integer when writing back to R.
When writing integers to R you have to make sure that these are 32 bit integers or below. This is because in R all integers are 32 bit, but in pandas you can have 64, 32, 16 or 8 bit integers. 64 bit integers cannot be translated to 32 bit integers because there is the risk of overflow. If you set your own integer columns, the best is to convert them to the type 'Int32' (observe the capital I) and pyreadr will convert them correctly to R integers.
Upvotes: 0