stevec
stevec

Reputation: 52358

group_by, look up value for group, return value and original data.frame? (efficiently, using dplyr)

I am attempting to group_by, locate the first appearance of a row that meets a criterion, and create a new column with a value based that selected row from each group.

Example

This is easy to demonstrate with an example. A new column should be generated by

  1. grouping by transaction_id
  2. looking for the first appearance of icecream_bool ( = 1) (note that the second transaction has two rows that meet this criteria, so the first should be taken)
  3. creating a new column using the value located in the 'item' column

We start with this data.frame

df <- data.frame(
  transaction_id=as.integer(c(1,1,1,2,2,2,2,2,3,3,3)),
  item=as.character(c("crisps", "magnum", "gum",
                      "jerky", "cheese", "snickers", "ben&jerry", "magnum",
                      "halo", "crisps", "mars")),
  icecream_bool=as.integer(c(0,1,0,
                             0,0,0,1,1,
                             1,0,0)),
  stringsAsFactors = F
)

#    transaction_id      item icecream_bool
# 1               1    crisps             0
# 2               1    magnum             1
# 3               1       gum             0
# 4               2     jerky             0
# 5               2    cheese             0
# 6               2  snickers             0
# 7               2 ben&jerry             1
# 8               2    magnum             1
# 9               3      halo             1
# 10              3    crisps             0
# 11              3      mars             0

Desired output

Generate ice_cream column like so

   transaction_id      item icecream_bool ice_cream
1               1    crisps             0    magnum
2               1    magnum             1    magnum
3               1       gum             0    magnum
4               2     jerky             0 ben&jerry
5               2    cheese             0 ben&jerry
6               2  snickers             0 ben&jerry
7               2 ben&jerry             1 ben&jerry
8               2    magnum             1 ben&jerry
9               3      halo             1      halo
10              3    crisps             0      halo
11              3      mars             0      halo

Upvotes: 0

Views: 49

Answers (4)

Ronak Shah
Ronak Shah

Reputation: 389055

Another way if you only have 1's and 0's in icecream_bool is by using which.max

library(dplyr)
df %>%
  group_by(transaction_id) %>%
  mutate(ice_cream = item[which.max(icecream_bool)])


#   transaction_id item      icecream_bool ice-cream
#            <int> <chr>             <int> <chr>    
# 1              1 crisps                0 magnum   
# 2              1 magnum                1 magnum   
# 3              1 gum                   0 magnum   
# 4              2 jerky                 0 ben&jerry
# 5              2 cheese                0 ben&jerry
# 6              2 snickers              0 ben&jerry
# 7              2 ben&jerry             1 ben&jerry
# 8              2 magnum                1 ben&jerry
# 9              3 halo                  1 halo     
#10              3 crisps                0 halo     
#11              3 mars                  0 halo     

This can also be written in data.table

library(data.table)
setDT(df)[, ice_cream := item[which.max(icecream_bool)], transaction_id]

If there are values which are greater than 1 and we need to compare with 1 we can use match

df %>%
  group_by(transaction_id) %>%
  mutate(ice_cream = item[match(1,icecream_bool)])

Upvotes: 2

lroha
lroha

Reputation: 34556

You can do:

library(dplyr)

df %>%
  group_by(transaction_id) %>%
  mutate(icecream = first(item[icecream_bool == 1]))

# A tibble: 11 x 4
# Groups:   transaction_id [3]
   transaction_id item      icecream_bool ice      
            <int> <chr>             <int> <chr>    
 1              1 crisps                0 magnum   
 2              1 magnum                1 magnum   
 3              1 gum                   0 magnum   
 4              2 jerky                 0 ben&jerry
 5              2 cheese                0 ben&jerry
 6              2 snickers              0 ben&jerry
 7              2 ben&jerry             1 ben&jerry
 8              2 magnum                1 ben&jerry
 9              3 halo                  1 halo     
10              3 crisps                0 halo     
11              3 mars                  0 halo   

Upvotes: 1

Edward
Edward

Reputation: 18868

One way using dplyr

library(dplyr)
df%>%
  group_by(transaction_id, icecream_bool) %>%
  mutate(mr = row_number()) %>%
  group_by(transaction_id) %>%
  mutate(ice_cream=item[mr==1 & icecream_bool==1]) %>%
  select(-mr)

# A tibble: 11 x 4
# Groups:   transaction_id [3]
   transaction_id item      icecream_bool ice_cream
            <int> <chr>             <int> <chr>    
 1              1 crisps                0 magnum   
 2              1 magnum                1 magnum   
 3              1 gum                   0 magnum   
 4              2 jerky                 0 ben&jerry
 5              2 cheese                0 ben&jerry
 6              2 snickers              0 ben&jerry
 7              2 ben&jerry             1 ben&jerry
 8              2 magnum                1 ben&jerry
 9              3 halo                  1 halo     
10              3 crisps                0 halo     
11              3 mars                  0 halo 

Upvotes: 1

stevec
stevec

Reputation: 52358

I'm not sure if it's efficient, but I found a way

df <- data.frame(
  transaction_id=as.integer(c(1,1,1,2,2,2,2,2,3,3,3)),
  item=as.character(c("crisps", "magnum", "gum",
                      "jerky", "cheese", "snickers", "ben&jerry", "magnum",
                      "halo", "crisps", "mars")),
  icecream_bool=as.integer(c(0,1,0,
                             0,0,0,1,1,
                             1,0,0)),
  stringsAsFactors = F
)

df %>% 
  group_by(transaction_id) %>% 
  filter(icecream_bool == 1) %>% 
  distinct(transaction_id, icecream_bool, .keep_all = T) %>% 
  select(-icecream_bool) %>% 
  rename(icecream=item) %>% 
  left_join(df, ., by=c("transaction_id"="transaction_id"))


   transaction_id      item icecream_bool  icecream
1               1    crisps             0    magnum
2               1    magnum             1    magnum
3               1       gum             0    magnum
4               2     jerky             0 ben&jerry
5               2    cheese             0 ben&jerry
6               2  snickers             0 ben&jerry
7               2 ben&jerry             1 ben&jerry
8               2    magnum             1 ben&jerry
9               3      halo             1      halo
10              3    crisps             0      halo
11              3      mars             0      halo

Upvotes: 0

Related Questions