Nick
Nick

Reputation: 195

Databricks: How to switch from R Dataframe to Pandas Dataframe (R to python in the same notebook)

I am writing R code in a Databricks notebook that performs several operations in R. Once the dataframe is cleaned up, I would like to invoke it in a python cell using '%python' and therefore use python code to continue operations on the dataframe.

I would thus like to transform, within the python block, my R Dataframe into a Pandas dataframe. Does anybody know how to do this? Thanks!

Upvotes: 6

Views: 7340

Answers (3)

TheDandyGent
TheDandyGent

Reputation: 1

There isn't a straightforward way to do this; it takes a few steps in Databricks:

  1. Convert R data.frame to SparkDataFrame
  2. Register SparkDataFrame as a temp view (this can't be done on a regular data.frame or data.table)
  3. Convert temp view to pandas or pyspark DataFrame

The second step is necessary for the Python cmd cell to be able to "find" the data frame. Otherwise you'll get the dreaded NameError mentioned previously.

Here's an example of what that might look like:

  1. R cmd cell:

df <- as.DataFrame(df)
createOrReplaceTempView(df, "df")

  1. Python cmd cell:

import pyspark
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Temp View").getOrCreate()

# Import the temp view
df_pandas = spark.sql("SELECT * FROM df")

Upvotes: 0

Keith
Keith

Reputation: 699

I think the namespace between different kernels is separate on Databricks. So even in the same notebook, you will not see an R variable in Python or vice versa.

My understanding is that there are two methods to share data between kernels: 1) using the filesystem (csv, etc) and 2) temporary Databricks tables. I believe the latter is the more typical route[1].

  1. Filesystem:
%r
write.csv(df, "/FileStore/tmp.csv")
%python
import pandas as pd
df = pd.read_csv("/FileStore/tmp.csv")
  1. Temporary databricks table:
%r
library(SparkR)
sparkR.session()
df <- read.df("path/to/original_file.csv", source="csv")
registerTempTable(df, "tmp_df")
%python
df = spark.sql("select * from tmp_df").toPandas()

[1] https://forums.databricks.com/questions/16039/use-python-and-r-variable-in-the-same-notebook-amo.html

Upvotes: 11

Twinkle Patel
Twinkle Patel

Reputation: 462

Note: Since rpy2 release 3.3.0 explicit conversion is done as follows

import rpy2.robjects as ro

dt = pd.DataFrame()

To R DataFrame

r_dt = ro.conversion.py2rpy(dt)

To pandas DataFrame

pd_dt = ro.conversion.rpy2py(r_dt)

Upvotes: 2

Related Questions