Reputation: 417
I create a pandas DataFrame:
import pandas as pd
df = pd.DataFrame(x.toarray(), columns = colnames)
Then I convert it to a R dataframe:
import pandas.rpy.common as com
rdf = com.convert_to_r_dataframe(df)
Under Windows with this configuration there are no problems:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: AMD64 Family 16 Model 4
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...
But when I execute it on Linux with this configuration:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...
I get this:
Traceback (most recent call last):
File "app.py", line 232, in <module>
clf.global_cl(df, df2)
File "/home/uzer/app/util/clftool.py", line 202, in global_cl
rdf = com.convert_to_r_dataframe(df)
File "/home/uzer/app/venv/local/lib/python2.7/site-packages/pandas/rpy/common.py", line 324, in convert_to_r_dataframe
value = VECTOR_TYPES[value_type](value)
KeyError: <type 'numpy.int64'>
It seems that VECTOR_TYPES does not have <type 'numpy.int64'>
as key. But this is not true:
VECTOR_TYPES = {np.float64: robj.FloatVector,
np.float32: robj.FloatVector,
np.float: robj.FloatVector,
np.int: robj.IntVector,
np.int32: robj.IntVector,
np.int64: robj.IntVector,
np.object_: robj.StrVector,
np.str: robj.StrVector,
np.bool: robj.BoolVector}
So I printed variable type in convert_to_r_dataframe
(in ../pandas/rpy/common.py
):
for column in df:
value = df[column]
value_type = value.dtype.type
print("value_type: %s") % value_type
if value_type == np.datetime64:
value = convert_to_r_posixct(value)
else:
value = [item if pd.notnull(item) else NA_TYPES[value_type]
for item in value]
print("Is value_type == np.int64: %s") % (value_type is np.int64)
value = VECTOR_TYPES[value_type](value)
...
And that's the result:
value_type: <type 'numpy.int64'>
Is value_type == np.int64: False
How can it be possible?? Given that the 32 bit Windows version has no problems, could be a problem with the 64 bit Linux Python version?
EDIT: Suggested by @lgautier, I modified this:
rdf = com.convert_to_r_dataframe(df)
to:
from rpy2.robjects import pandas2ri
rdf = pandas2ri.pandas2ri(df)
And that worked.
NOTE: My dataframe contains utf-8 special characters, as column names, decoded in unicode. When DataFrame
constructor is called (contained in rpy2/robjects/vectors.py
), this line try to encode the unicode string (that contain special characters) to an ascii string:
kv = [(str(k), conversion.py2ri(obj[k])) for k in obj]
This generate an error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
To solve this I had to change that line in:
kv = [(k.encode('UTF-8'), conversion.py2ri(obj[k])) for k in obj]
Rpy2 should introduce a method that allows to change the encoding.
Upvotes: 3
Views: 1468
Reputation: 11543
Consider using rpy2's own conversion (which appear to be working with int64
on Linux):
# create a test DataFrame
import numpy
import pandas
i2d = numpy.array([[1, 2, 3], [4, 5, 6]], dtype="int64")
colnames = ('a', 'b', 'c')
dataf = pandas.DataFrame(i2d,
columns = colnames)
# rpy2's conversion of pandas objects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
Now pandas DataFrame objects will be converted automatically to rpy2/R DataFrame objects on each call using the embedded R. For example:
from rpy2.robjects.packages import importr
# R's "base" package
base = importr('base')
# call the R function "summary"
print(base.summary(dataf))
One can also call the conversion explicitly:
from rpy2.robjects import conversion
rpy2_dataf = conversion.py2ro(dataf)
edit: (added customization to work around the str(k)
issue)
Should anything related to the conversion be requiring local customization, this can be achieved relatively easily. One way
to change the way the R DataFrame
is built is:
import pandas.DataFrame as PandasDataFrame
import rpy2.robjects.vectors.DataFrame as RDataFrame
from rpy2 import rinterface
@conversion.py2ro.register(PandasDataFrame)
def py2ro_pandasdataframe(obj):
ri_dataf = conversion.py2ri(obj)
# cast down to an R list (goes through a different code path
# in the DataFrame constructor, avoiding `str(k)`)
ri_list = rinterface.SexpVector(ri_dataf)
return RDataFrame(ri_list)
From now on, the conversion function above will be used when a pandas
DataFrame
is present:
rpy2_dataf = conversion.py2ro(dataf)
Upvotes: 3