Reputation: 195
I am writing R code in a Databricks notebook that performs several operations in R. Once the dataframe is cleaned up, I would like to invoke it in a python cell using '%python' and therefore use python code to continue operations on the dataframe.
I would thus like to transform, within the python block, my R Dataframe into a Pandas dataframe. Does anybody know how to do this? Thanks!
Upvotes: 6
Views: 7340
Reputation: 1
There isn't a straightforward way to do this; it takes a few steps in Databricks:
The second step is necessary for the Python cmd cell to be able to "find" the data frame. Otherwise you'll get the dreaded NameError mentioned previously.
Here's an example of what that might look like:
df <- as.DataFrame(df)
createOrReplaceTempView(df, "df")
import pyspark
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Temp View").getOrCreate()
# Import the temp view
df_pandas = spark.sql("SELECT * FROM df")
Upvotes: 0
Reputation: 699
I think the namespace between different kernels is separate on Databricks. So even in the same notebook, you will not see an R variable in Python or vice versa.
My understanding is that there are two methods to share data between kernels: 1) using the filesystem (csv, etc) and 2) temporary Databricks tables. I believe the latter is the more typical route[1].
%r
write.csv(df, "/FileStore/tmp.csv")
%python
import pandas as pd
df = pd.read_csv("/FileStore/tmp.csv")
%r
library(SparkR)
sparkR.session()
df <- read.df("path/to/original_file.csv", source="csv")
registerTempTable(df, "tmp_df")
%python
df = spark.sql("select * from tmp_df").toPandas()
Upvotes: 11
Reputation: 462
Note: Since rpy2 release 3.3.0 explicit conversion is done as follows
import rpy2.robjects as ro
dt = pd.DataFrame()
To R DataFrame
r_dt = ro.conversion.py2rpy(dt)
To pandas DataFrame
pd_dt = ro.conversion.rpy2py(r_dt)
Upvotes: 2