MrMoog
MrMoog

Reputation: 417

Pandas convert_to_r_dataframe function KeyError

I create a pandas DataFrame:

import pandas as pd

df = pd.DataFrame(x.toarray(), columns = colnames)

Then I convert it to a R dataframe:

import pandas.rpy.common as com

rdf = com.convert_to_r_dataframe(df)

Under Windows with this configuration there are no problems:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: AMD64 Family 16 Model 4
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...

But when I execute it on Linux with this configuration:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...

I get this:

Traceback (most recent call last):
  File "app.py", line 232, in <module>
    clf.global_cl(df, df2)
  File "/home/uzer/app/util/clftool.py", line 202, in global_cl
    rdf = com.convert_to_r_dataframe(df)
  File "/home/uzer/app/venv/local/lib/python2.7/site-packages/pandas/rpy/common.py", line 324, in convert_to_r_dataframe
    value = VECTOR_TYPES[value_type](value)
KeyError: <type 'numpy.int64'>

It seems that VECTOR_TYPES does not have <type 'numpy.int64'> as key. But this is not true:

VECTOR_TYPES = {np.float64: robj.FloatVector,
            np.float32: robj.FloatVector,
            np.float: robj.FloatVector,
            np.int: robj.IntVector,
            np.int32: robj.IntVector,
            np.int64: robj.IntVector,
            np.object_: robj.StrVector,
            np.str: robj.StrVector,
            np.bool: robj.BoolVector}

So I printed variable type in convert_to_r_dataframe (in ../pandas/rpy/common.py):

for column in df:
    value = df[column]
    value_type = value.dtype.type
    print("value_type: %s") % value_type
    if value_type == np.datetime64:
        value = convert_to_r_posixct(value)
    else:
        value = [item if pd.notnull(item) else NA_TYPES[value_type]
                 for item in value]
        print("Is value_type == np.int64: %s") % (value_type is np.int64)
        value = VECTOR_TYPES[value_type](value)
        ...

And that's the result:

value_type: <type 'numpy.int64'>
Is value_type == np.int64: False

How can it be possible?? Given that the 32 bit Windows version has no problems, could be a problem with the 64 bit Linux Python version?

EDIT: Suggested by @lgautier, I modified this:

rdf = com.convert_to_r_dataframe(df)

to:

from rpy2.robjects import pandas2ri
rdf = pandas2ri.pandas2ri(df)

And that worked.

NOTE: My dataframe contains utf-8 special characters, as column names, decoded in unicode. When DataFrame constructor is called (contained in rpy2/robjects/vectors.py), this line try to encode the unicode string (that contain special characters) to an ascii string:

kv = [(str(k), conversion.py2ri(obj[k])) for k in obj]

This generate an error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

To solve this I had to change that line in:

kv = [(k.encode('UTF-8'), conversion.py2ri(obj[k])) for k in obj]

Rpy2 should introduce a method that allows to change the encoding.

Upvotes: 3

Views: 1468

Answers (1)

lgautier
lgautier

Reputation: 11543

Consider using rpy2's own conversion (which appear to be working with int64 on Linux):

# create a test DataFrame
import numpy
import pandas

i2d = numpy.array([[1, 2, 3], [4, 5, 6]], dtype="int64")
colnames = ('a', 'b', 'c')
dataf = pandas.DataFrame(i2d, 
                         columns = colnames)

# rpy2's conversion of pandas objects
from rpy2.robjects import pandas2ri
pandas2ri.activate()

Now pandas DataFrame objects will be converted automatically to rpy2/R DataFrame objects on each call using the embedded R. For example:

from rpy2.robjects.packages import importr
# R's "base" package
base = importr('base')
# call the R function "summary"
print(base.summary(dataf))

One can also call the conversion explicitly:

from rpy2.robjects import conversion
rpy2_dataf = conversion.py2ro(dataf)

edit: (added customization to work around the str(k) issue)

Should anything related to the conversion be requiring local customization, this can be achieved relatively easily. One way to change the way the R DataFrame is built is:

import pandas.DataFrame as PandasDataFrame
import rpy2.robjects.vectors.DataFrame as RDataFrame
from rpy2 import rinterface
@conversion.py2ro.register(PandasDataFrame)
def py2ro_pandasdataframe(obj):
    ri_dataf = conversion.py2ri(obj)
    # cast down to an R list (goes through a different code path
    # in the DataFrame constructor, avoiding `str(k)`) 
    ri_list = rinterface.SexpVector(ri_dataf)
    return RDataFrame(ri_list)

From now on, the conversion function above will be used when a pandas DataFrame is present:

rpy2_dataf = conversion.py2ro(dataf)

Upvotes: 3

Related Questions