Rpy2 issue with converting df back to pandas

Question

I have an R dataframe that I've processed:

import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

from rpy2.robjects.conversion import localconverter
pandas2ri.activate()
import pandas as pd

%%R
n = c(2, 3, 5) 
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE) 
r_df = data.frame(n, s, b)
r_df[['c']]=NA

r_df

#out:
#  n  s b  c
#1 2 aa 1 NA
#2 3 bb 0 NA
#3 5 cc 1 NA

When I convert it to pandas, it replaces NA with integers.

with localconverter(ro.default_converter + pandas2ri.converter):
    pd_from_r_df = ro.conversion.rpy2py(ro.r('r_df'))

pd_from_r_df
#Out:
#   n        s  b   c
#1  2.0     aa  1   -2147483648
#2  3.0     bb  0   -2147483648
#3  5.0     cc  1   -2147483648

I have tried to set different data types in the columns of r_df, but to no avail. How can I fix this issue?

Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas solves the issue. But it should be simpler than this

lgautier · Accepted Answer

The likely issue is that R has an "NA" value for boolean values ("logical vectors" in R lingo) and integer values while Python/numpy does not.

Look at how the dtype changed between the two following examples:

In [1]: import pandas                     

In [2]: pandas.Series([True, False, True])
Out[2]: 
0     True
1    False
2     True
dtype: bool

In [3]: pandas.Series([True, False, None])
Out[3]: 
0     True
1    False
2     None
dtype: object

Here what is happening is that the column "c" in your R data frame is of type "logical" (LGLSXP) but in C this is an R array of integer values using only one of 0, 1, and -2147483648 (for FALSE, TRUE, and NA respectively). The rpy2 converter is converting to a numpy vector of integers because:

rpy2 implements the numpy array interface to allow matching C arrays across the two languages.
numpy uses that interface (numpy.array() is called by rpy2)

This is admittedly only one of the ways to approach conversion and there are situations where this is not the most convenient. Using a custom converter can be used to get a behavior that would suit you better.

PS: One more note about your workaround below

Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas solves the issue. But it should be simpler than this

What is happening here is that you are converting the R boolean vector into a vector of strings.

Rpy2 issue with converting df back to pandas

Answers (1)

Related Questions