Reputation: 1949
I have an R dataframe that I've processed:
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
pandas2ri.activate()
import pandas as pd
%%R
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
r_df = data.frame(n, s, b)
r_df[['c']]=NA
r_df
#out:
# n s b c
#1 2 aa 1 NA
#2 3 bb 0 NA
#3 5 cc 1 NA
When I convert it to pandas, it replaces NA
with integers.
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(ro.r('r_df'))
pd_from_r_df
#Out:
# n s b c
#1 2.0 aa 1 -2147483648
#2 3.0 bb 0 -2147483648
#3 5.0 cc 1 -2147483648
I have tried to set different data types in the columns of r_df
, but to no avail. How can I fix this issue?
Note, setting r_df[is.na(r_df)]='None'
prior to converting to pandas solves the issue. But it should be simpler than this
Upvotes: 1
Views: 800
Reputation: 11545
The likely issue is that R has an "NA" value for boolean values ("logical vectors" in R lingo) and integer values while Python/numpy does not.
Look at how the dtype
changed between the two following examples:
In [1]: import pandas
In [2]: pandas.Series([True, False, True])
Out[2]:
0 True
1 False
2 True
dtype: bool
In [3]: pandas.Series([True, False, None])
Out[3]:
0 True
1 False
2 None
dtype: object
Here what is happening is that the column "c" in your R data frame is of type "logical" (LGLSXP
) but in C this is an R array of integer values using only one of 0, 1, and -2147483648 (for FALSE
, TRUE
, and NA
respectively). The rpy2 converter is converting to a numpy
vector of integers because:
rpy2
implements the numpy array interface to allow matching C arrays across the two languages.numpy
uses that interface (numpy.array()
is called by rpy2
)This is admittedly only one of the ways to approach conversion and there are situations where this is not the most convenient. Using a custom converter can be used to get a behavior that would suit you better.
PS: One more note about your workaround below
Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas solves the issue. But it should be simpler than this
What is happening here is that you are converting the R boolean vector into a vector of strings.
Upvotes: 2