Reputation: 841
I am trying to split a DataFrame (DF), B
into two distinct (by row) subsets. I first sample
the DF to produce a new DF that includes approx. half of the rows from B
. Then, I try to filter
on the DF with the condition that this new DF, b2
includes every row of B
with z
values that are not equal to the z
values included in b1
.
This seems like it should be fairly straightforward. However, the filter
expression just results in an empty DataFrame. Am I misunderstanding the syntax for filter
, or can you simply not refer to distinct DataFrames in SparkR operations?
w <- rbind(3, 0, 2, 3, NA, 1)
z <- rbind("a", "b", "c", "d", "e", "f")
d2 <- cbind.data.frame(w, z)
B <- as.DataFrame(sqlContext, d2)
b1 <- sample(B, FALSE, 0.5)
b2 <- filter(B, B$z != b1$z)
Upvotes: 1
Views: 120
Reputation: 480
SparkR offers a set of functions built on Spark SQL that provide useful data manipulation abilities. A way of achieving this is with the SparkR except() command (think of it like an != join in SQL):
w <- rbind(3, 0, 2, 3, NA, 1)
z <- rbind("a", "b", "c", "d", "e", "f")
d2 <- cbind.data.frame(w, z)
B <- as.DataFrame(sqlContext, d2)
b1 <- sample(B, FALSE, 0.5)
b2 <- except(B, b1)
Upvotes: 1
Reputation: 330423
There actually two different problems here:
join
operation first.Since as for now (Spark 1.6) SparkR doesn't provide randomSplit
you can apply split manually
seed <- stats::runif(1)
b <- B %>% withColumn("rand", rand(seed))
b1 <- b %>% where(b$rand <= 0.5)
b2 <- b %>% where(b$rand > 0.5)
Upvotes: 1