Reputation: 680
I have a DataFrame in sparkR called pgz
. It contains user_id
and time
. For a fixed user_id k
I get
y <- filter(pgz, pgz$user_id == k)
When I type head(y)
and I can see some of the times for user_id k
. "2005-02-04" , "2005-06-06".. They are all sorted so they increase.
For this user_id
I want to test if he has times
larger than a fixed time I set to
fixtime <- "2010-01-01"
I would like to save all user_id's that has times larger than fixtime. How can this be done?
Upvotes: 1
Views: 66
Reputation: 60492
To get started, let's create some example data to test
set.seed(1)
dd = data.frame(id = base::sample(1:3, 4,TRUE),
times = base::sample(c("2005-02-04" , "2005-06-06", "2007-02-04" , "2006-06-06"),
12, TRUE))
dd$times = as.Date(dd$times)
NROW(dd[dd$id==1 & dd$times > as.Date("2006-01-01"),])
For this data set, we should get the answer 2.
Create Spark data frame
dd_sp = createDataFrame(sqlContext, dd)
and then filter
dd_sp_k = filter(dd_sp, dd_sp$id== 1 &
dd_sp$times > as.Date("2006-01-01"))
Then we can use summarise
to get the length of the data frame
## This seems a bit clunky, bit it works.
summarize(dd_sp_k, count = n(dd_sp_k$times)) %>%
head
which gives 2.
Upvotes: 1