Reference two separate DataFrames in SparkR operations

Question

I am trying to split a DataFrame (DF), B into two distinct (by row) subsets. I first sample the DF to produce a new DF that includes approx. half of the rows from B. Then, I try to filter on the DF with the condition that this new DF, b2 includes every row of B with z values that are not equal to the z values included in b1.

This seems like it should be fairly straightforward. However, the filter expression just results in an empty DataFrame. Am I misunderstanding the syntax for filter, or can you simply not refer to distinct DataFrames in SparkR operations?

w <- rbind(3, 0, 2, 3, NA, 1)
z <- rbind("a", "b", "c", "d", "e", "f")

d2 <- cbind.data.frame(w, z)
B <- as.DataFrame(sqlContext, d2)

b1 <- sample(B, FALSE, 0.5)
b2 <- filter(B, B$z != b1$z)

zero323 · Accepted Answer

There actually two different problems here:

In general you cannot reference another table inside filter without performing some type of join operation first.
In this particular, due to common lineage, this construct a trivially true equality (it is similar issue to SPARK-6231), hence no exception but empty result set.

Since as for now (Spark 1.6) SparkR doesn't provide randomSplit you can apply split manually

seed <- stats::runif(1)
b <- B %>% withColumn("rand", rand(seed))
b1 <- b %>% where(b$rand <= 0.5)
b2 <- b %>% where(b$rand > 0.5)

Reference two separate DataFrames in SparkR operations

Answers (2)

Related Questions