Reputation: 543
I have a data frame with thousands of rows but a sample is given below:
userid event
1 123 view
2 123 view
3 123 order
4 345 view
5 345 view
6 345 view
7 345 order
8 111 view
9 111 order
10 111 view
11 111 view
12 111 view
13 333 view
14 333 view
15 333 view
dput(data)
structure(list(userid = c(123, 123, 123, 345, 345, 345, 345,
111, 111, 111, 111, 111, 333, 333, 333), eventaction = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("order",
"view"), class = "factor")), .Names = c("userid", "event"
), row.names = c(NA, -15L), class = "data.frame")
What I am looking at doing is to extract all the rows of userid's that contains the word "order" under event. The result will contain all the rows for userid's excluding userid = 333 as the eventaction does not contain an order entry.
The second task is to count the number of occurences for "view" before the order entry. I will appreciate help and pointers.
Thanks.
Upvotes: 2
Views: 974
Reputation: 7435
To do your second task with the possibility that there are more than one order for a userid
:
library(dplyr)
df %>% group_by(userid) %>%
mutate(row_num = row_number()) %>%
filter(event=="order") %>%
mutate(num_views_before=c(first(row_num),diff(row_num))-1)
Notes:
group_by
userid
.diff
on the row numbers created beforehand.To test, I modified your data to change the event in row 12 to "order" so that userid=111
has two orders.
Modified data:
structure(list(userid = c(123, 123, 123, 345, 345, 345, 345,
111, 111, 111, 111, 111, 333, 333, 333), event = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("order",
"view"), class = "factor")), .Names = c("userid", "event"), row.names = c(NA,
-15L), class = "data.frame")
## userid event
##1 123 view
##2 123 view
##3 123 order
##4 345 view
##5 345 view
##6 345 view
##7 345 order
##8 111 view
##9 111 order
##10 111 view
##11 111 view
##12 111 order
##13 333 view
##14 333 view
##15 333 view
With this data, we get:
##Source: local data frame [4 x 4]
##Groups: userid [3]
##
## userid event row_num num_views_before
## <dbl> <fctr> <int> <dbl>
##1 123 order 3 2
##2 345 order 4 3
##3 111 order 2 1
##4 111 order 5 2
Upvotes: 1
Reputation: 12937
You could do this:
df[df$userid %in% df[df$event=="order",]$userid,]
OR with subset
:
subset(df, df$userid %in% subset(df, event=="order")$userid)
OR match
function:
subset(df, match(df$userid, subset(df, event=="order")$userid, nomatch = 0)>0)
OR using sqldf
library:
library(sqldf)
sqldf("select * from df where df.userid in (select df.userid from df where df.event=='order')")
# userid event
# 1 123 view
# 2 123 view
# 3 123 order
# 4 345 view
# 5 345 view
# 6 345 view
# 7 345 order
# 8 111 view
# 9 111 order
# 10 111 view
# 11 111 view
# 12 111 view
Upvotes: 0
Reputation: 2165
With standard R, if you call mydat
to your data.frame:
myusers <- mydat[mydat$event == "order", "userid"]
mydat[mydat$userid %in% myusers,]
Upvotes: 0
Reputation: 886948
We can try with data.table
. Convert the 'data.frame' to 'data.table' (setDT(data)
), grouped by 'userid', if
there are any
'event' that is an 'order' in the 'userid', return the Subset of Data.table' (.SD
)
library(data.table)
setDT(data)[,if(any(event=="order")) .SD , by = userid]
Or using dplyr
, we filter
for any
'order' in the 'event' after grouping by 'userid'.
library(dplyr)
data %>%
group_by(userid) %>%
filter(any(event == "order"))
Upvotes: 3