Reputation: 543

Extracting multiple rows for each ID based on a condition

I have a data frame with thousands of rows but a sample is given below:

     userid     event
1     123        view
2     123        view
3     123       order
4     345        view
5     345        view
6     345        view
7     345       order
8     111        view
9     111       order
10    111        view
11    111        view
12    111        view
13    333        view
14    333        view
15    333        view

dput(data)

structure(list(userid = c(123, 123, 123, 345, 345, 345, 345, 
111, 111, 111, 111, 111, 333, 333, 333), eventaction = structure(c(2L, 
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("order", 
"view"), class = "factor")), .Names = c("userid", "event"
), row.names = c(NA, -15L), class = "data.frame")

What I am looking at doing is to extract all the rows of userid's that contains the word "order" under event. The result will contain all the rows for userid's excluding userid = 333 as the eventaction does not contain an order entry.

The second task is to count the number of occurences for "view" before the order entry. I will appreciate help and pointers.

Thanks.

Upvotes: 2

Answers (4)

aichao

Reputation: 7455

To do your second task with the possibility that there are more than one order for a userid:

library(dplyr)
df %>% group_by(userid) %>% 
       mutate(row_num = row_number()) %>% 
       filter(event=="order") %>% 
       mutate(num_views_before=c(first(row_num),diff(row_num))-1)

Notes:

We group_by userid.
We add a column to track the rows for the group.
We keep only those rows with "order".
We compute the number of views before each order using diff on the row numbers created beforehand.

To test, I modified your data to change the event in row 12 to "order" so that userid=111 has two orders.

Modified data:

structure(list(userid = c(123, 123, 123, 345, 345, 345, 345, 
111, 111, 111, 111, 111, 333, 333, 333), event = structure(c(2L, 
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("order", 
"view"), class = "factor")), .Names = c("userid", "event"), row.names = c(NA, 
-15L), class = "data.frame")
##   userid event
##1     123  view
##2     123  view
##3     123 order
##4     345  view
##5     345  view
##6     345  view
##7     345 order
##8     111  view
##9     111 order
##10    111  view
##11    111  view
##12    111 order
##13    333  view
##14    333  view
##15    333  view

With this data, we get:

##Source: local data frame [4 x 4]
##Groups: userid [3]
##
##  userid  event row_num num_views_before
##   <dbl> <fctr>   <int>            <dbl>
##1    123  order       3                2
##2    345  order       4                3
##3    111  order       2                1
##4    111  order       5                2

Upvotes: 1

989

Reputation: 12935

You could do this:

df[df$userid %in% df[df$event=="order",]$userid,]

OR with subset:

subset(df, df$userid %in% subset(df, event=="order")$userid)

OR match function:

subset(df, match(df$userid, subset(df, event=="order")$userid, nomatch = 0)>0)

OR using sqldf library:

library(sqldf)
sqldf("select * from df where df.userid in (select df.userid from df where df.event=='order')")

   # userid event
# 1     123  view
# 2     123  view
# 3     123 order
# 4     345  view
# 5     345  view
# 6     345  view
# 7     345 order
# 8     111  view
# 9     111 order
# 10    111  view
# 11    111  view
# 12    111  view

Upvotes: 0

dmontaner

Reputation: 2165

With standard R, if you call mydat to your data.frame:

myusers <- mydat[mydat$event == "order", "userid"]
mydat[mydat$userid %in% myusers,]

Upvotes: 0

akrun

Reputation: 887851

We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'userid', if there are any 'event' that is an 'order' in the 'userid', return the Subset of Data.table' (.SD)

library(data.table)
setDT(data)[,if(any(event=="order")) .SD , by = userid]

Or using dplyr, we filter for any 'order' in the 'event' after grouping by 'userid'.

library(dplyr)
data %>%
    group_by(userid) %>%
    filter(any(event == "order"))

Upvotes: 3

Extracting multiple rows for each ID based on a condition

Answers (4)

Related Questions