How to subset SparkR data frame

Question

Assume we have a dataset 'people' which contains ID and Age as a 2 times 3 matrix.

Id = 1 2 3
Age= 21 18 30

In sparkR I want to create a new dataset people2 which contains all ID who are older than 18. In this case it's ID 1 and 3. In sparkR I would do this

people2 <- people$Age > 18

but it does not work. How would you create the new dataset?

SpiritusPrana · Accepted Answer

For those who appreciate R's multitude of options to do any given task, you can also use the SparkR::subset() function:

> people <- createDataFrame(sqlContext, data.frame(Id=1:3, Age=c(21, 18, 30)))
> people2 <- subset(people, people$Age > 18, select = c(1,2))
> head(people2)
  Id Age
1  1  21
2  3  30

To answer the additional detail in the comment:

id <- 1:99
age <- 99:1
myRDF <- data.frame(id, age)
mySparkDF <- createDataFrame(sqlContext, myRDF)

newSparkDF <- subset(mySparkDF, 
        mySparkDF$id==3 | mySparkDF$id==32 | mySparkDF$id==43 | mySparkDF$id==55, 
        select = 1:2)
take(newSparkDF,5)

(1) Spark Jobs
  id age
1  3  97
2 32  68
3 43  57
4 55  45

How to subset SparkR data frame

Answers (2)

Related Questions