Sos
Sos

Reputation: 1949

Rpy2 issue with converting df back to pandas

I have an R dataframe that I've processed:

import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

from rpy2.robjects.conversion import localconverter
pandas2ri.activate()
import pandas as pd

%%R
n = c(2, 3, 5) 
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE) 
r_df = data.frame(n, s, b)
r_df[['c']]=NA

r_df

#out:
#  n  s b  c
#1 2 aa 1 NA
#2 3 bb 0 NA
#3 5 cc 1 NA

When I convert it to pandas, it replaces NA with integers.

with localconverter(ro.default_converter + pandas2ri.converter):
    pd_from_r_df = ro.conversion.rpy2py(ro.r('r_df'))

pd_from_r_df
#Out:
#   n        s  b   c
#1  2.0     aa  1   -2147483648
#2  3.0     bb  0   -2147483648
#3  5.0     cc  1   -2147483648

I have tried to set different data types in the columns of r_df, but to no avail. How can I fix this issue?

Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas solves the issue. But it should be simpler than this

Upvotes: 1

Views: 800

Answers (1)

lgautier
lgautier

Reputation: 11545

The likely issue is that R has an "NA" value for boolean values ("logical vectors" in R lingo) and integer values while Python/numpy does not.

Look at how the dtype changed between the two following examples:

In [1]: import pandas                     

In [2]: pandas.Series([True, False, True])
Out[2]: 
0     True
1    False
2     True
dtype: bool

In [3]: pandas.Series([True, False, None])
Out[3]: 
0     True
1    False
2     None
dtype: object

Here what is happening is that the column "c" in your R data frame is of type "logical" (LGLSXP) but in C this is an R array of integer values using only one of 0, 1, and -2147483648 (for FALSE, TRUE, and NA respectively). The rpy2 converter is converting to a numpy vector of integers because:

This is admittedly only one of the ways to approach conversion and there are situations where this is not the most convenient. Using a custom converter can be used to get a behavior that would suit you better.

PS: One more note about your workaround below

Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas solves the issue. But it should be simpler than this

What is happening here is that you are converting the R boolean vector into a vector of strings.

Upvotes: 2

Related Questions