How to test entries in a DataFrame in sparkR

Question

I have a DataFrame in sparkR called pgz. It contains user_id and time. For a fixed user_id k I get

y <- filter(pgz, pgz$user_id == k)

When I type head(y) and I can see some of the times for user_id k. "2005-02-04" , "2005-06-06".. They are all sorted so they increase. For this user_id I want to test if he has times larger than a fixed time I set to

fixtime <- "2010-01-01"

I would like to save all user_id's that has times larger than fixtime. How can this be done?

csgillespie · Accepted Answer

To get started, let's create some example data to test

set.seed(1)
dd = data.frame(id = base::sample(1:3, 4,TRUE), 
                times = base::sample(c("2005-02-04" , "2005-06-06", "2007-02-04" , "2006-06-06"), 
                                     12, TRUE))
dd$times = as.Date(dd$times)
NROW(dd[dd$id==1 & dd$times > as.Date("2006-01-01"),])

For this data set, we should get the answer 2.

Create Spark data frame

dd_sp = createDataFrame(sqlContext, dd)

and then filter

dd_sp_k = filter(dd_sp, dd_sp$id== 1 & 
             dd_sp$times > as.Date("2006-01-01"))

Then we can use summarise to get the length of the data frame

## This seems a bit clunky, bit it works.
summarize(dd_sp_k, count = n(dd_sp_k$times)) %>%
  head

which gives 2.

How to test entries in a DataFrame in sparkR

Answers (1)

Related Questions