Reputation: 1484

R get one value according to some rules in each group

I have a dataframe:

test = data.frame(short_name = rep(c('a','b','c'),each = 3),full_name = c('apple','ahora','aixin','bike','beber','boai','cerrar','cat','caocao'))

which like:

short_name   full_name
    a         apple
    a         ahora
    a         aixin
    b         bike
    b         beber
    b         boai
    c         cerrar
    c         cat
    c         caocao

I want to based on short_name as a group to get a value from the full_name, it can be:

1) get the first element(first row of that group), in my case that would be:

short_name   full_name
    a         apple
    b         bike
    c         cerrar

2) get a random element from full_name

3) get a element according to some rules,in my case you may notice they are just three different languages: English,spanish and chinese; I may put a function here tell me what languages it origins and pick say spanish for each group as the full_name of the shortcut,however that function is irrelevant to this topic so I just want to get the shortest one from each group, and the tie break rule is always get the first one in that group if they are of the same length,the result should be:

short_name   full_name
    a         apple
    b         bike
    c         cat

you can use any package(data.table,dplyr,etc) or self-write method, I want to see different solution and get the most efficient and elegant one

According to recent answers, my calculation based on big data(8 million records) are:

library(tictoc)
library(dplyr)
tic("dplyr slice1")
sale_data_detail_ly_slice1<-sale_data_detail_ly %>% group_by(prod_id) %>% slice(1)
toc()
dplyr slice1: 26.966 sec elapsed

tic("data.table")
sale_data_detail_ly_slice1 = sale_data_detail_ly[,.SD[1,],by = prod_id]
toc()
data.table: 501.416 sec elapsed

I could see a big difference

Upvotes: 0

Answers (3)

PavoDive

Reputation: 6496

A data.table solution slightly different from @akrun's:

test[, .SD[1,], by = short_name]

test[, .SD[sample(.N, 1),], by = short_name]

test[, .SD[which.min(nchar(as.character(full_name))),], by = short_name]

Upvotes: 1

Marius

Reputation: 60180

As long as you can figure out how to calculate the value you want within each group, you can do any kind of selection with group_by and summarise. Doing them all in one go:

test %>%
    group_by(short_name) %>%
    summarise(
        first = first(full_name),
        random = sample(full_name, 1),
        # as.character needed here because full_name is currently
        #   a factor
        shortest = full_name[which.min(nchar(as.character(full_name)))]
    )

Upvotes: 2

akrun

Reputation: 887831

We can do a group by 'short_name' andd get the first row with slice

library(dplyr)
test %>% 
   group_by(short_name) %>%
   slice(1)

Or to get random element

test %>%
  group_by(short_name) %>%
  slice(sample(row_number(), 1))

If it is the shortest one

test %>%
   group_by(short_name) %>%
   slice(which.min(nchar(as.character(full_name))))
# A tibble: 3 x 2
# Groups:   short_name [3]
#  short_name full_name
#  <fct>      <fct>    
#1 a          apple    
#2 b          bike     
#3 c          cat

Or using summarise

test %>%
    group_by(short_name) %>%
    summarise(full_name = first(full_name))

test %>%
    group_by(short_name) %>%
    summarise(full_name = sample(full_name, 1))

With data.table, the options are

library(data.table)
setDT(test)[test[, .I[1], .(short_name)]$V1]
setDT(test)[test[, .I[sample(seq_len(.N), 1)], .(short_name)]$V1]

Upvotes: 2

R get one value according to some rules in each group

Answers (3)

Related Questions