Reputation: 680
Assume we have a dataset 'people' which contains ID and Age as a 2 times 3 matrix.
Id = 1 2 3
Age= 21 18 30
In sparkR I want to create a new dataset people2
which contains all ID who are older than 18. In this case it's ID 1 and 3. In sparkR I would do this
people2 <- people$Age > 18
but it does not work. How would you create the new dataset?
Upvotes: 1
Views: 3184
Reputation: 480
For those who appreciate R's multitude of options to do any given task, you can also use the SparkR::subset() function:
> people <- createDataFrame(sqlContext, data.frame(Id=1:3, Age=c(21, 18, 30)))
> people2 <- subset(people, people$Age > 18, select = c(1,2))
> head(people2)
Id Age
1 1 21
2 3 30
To answer the additional detail in the comment:
id <- 1:99
age <- 99:1
myRDF <- data.frame(id, age)
mySparkDF <- createDataFrame(sqlContext, myRDF)
newSparkDF <- subset(mySparkDF,
mySparkDF$id==3 | mySparkDF$id==32 | mySparkDF$id==43 | mySparkDF$id==55,
select = 1:2)
take(newSparkDF,5)
(1) Spark Jobs
id age
1 3 97
2 32 68
3 43 57
4 55 45
Upvotes: 1
Reputation: 330283
You can use SparkR::filter
with either condition:
> people <- createDataFrame(sqlContext, data.frame(Id=1:3, Age=c(21, 18, 30)))
> filter(people, people$Age > 18) %>% head()
Id Age
1 1 21
2 3 30
or SQL string:
> filter(people, "Age > 18") %>% head()
Id Age
1 1 21
2 3 30
It is also possible to use SparkR::sql
function with raw SQL query on a registered table:
> registerTempTable(people, "people"
> sql(sqlContext, "SELECT * FROM people WHERE Age > 18") %>% head()
Id Age
1 1 21
2 3 30
Upvotes: 2