Reputation: 1484
I have a dataframe:
test = data.frame(short_name = rep(c('a','b','c'),each = 3),full_name = c('apple','ahora','aixin','bike','beber','boai','cerrar','cat','caocao'))
which like:
short_name full_name
a apple
a ahora
a aixin
b bike
b beber
b boai
c cerrar
c cat
c caocao
I want to based on short_name as a group to get a value from the full_name, it can be:
1) get the first element(first row of that group), in my case that would be:
short_name full_name
a apple
b bike
c cerrar
2) get a random element from full_name
3) get a element according to some rules,in my case you may notice they are just three different languages: English,spanish and chinese; I may put a function here tell me what languages it origins and pick say spanish for each group as the full_name of the shortcut,however that function is irrelevant to this topic so I just want to get the shortest one from each group, and the tie break rule is always get the first one in that group if they are of the same length,the result should be:
short_name full_name
a apple
b bike
c cat
you can use any package(data.table,dplyr,etc) or self-write method, I want to see different solution and get the most efficient and elegant one
According to recent answers, my calculation based on big data(8 million records) are:
library(tictoc)
library(dplyr)
tic("dplyr slice1")
sale_data_detail_ly_slice1<-sale_data_detail_ly %>% group_by(prod_id) %>% slice(1)
toc()
dplyr slice1: 26.966 sec elapsed
tic("data.table")
sale_data_detail_ly_slice1 = sale_data_detail_ly[,.SD[1,],by = prod_id]
toc()
data.table: 501.416 sec elapsed
I could see a big difference
Upvotes: 0
Views: 176
Reputation: 6496
A data.table
solution slightly different from @akrun's:
test[, .SD[1,], by = short_name]
test[, .SD[sample(.N, 1),], by = short_name]
test[, .SD[which.min(nchar(as.character(full_name))),], by = short_name]
Upvotes: 1
Reputation: 60180
As long as you can figure out how to calculate the value you want within each group, you can do any kind of selection with group_by
and summarise
. Doing them all in one go:
test %>%
group_by(short_name) %>%
summarise(
first = first(full_name),
random = sample(full_name, 1),
# as.character needed here because full_name is currently
# a factor
shortest = full_name[which.min(nchar(as.character(full_name)))]
)
Upvotes: 2
Reputation: 887831
We can do a group by 'short_name' andd get the first row with slice
library(dplyr)
test %>%
group_by(short_name) %>%
slice(1)
Or to get random element
test %>%
group_by(short_name) %>%
slice(sample(row_number(), 1))
If it is the shortest one
test %>%
group_by(short_name) %>%
slice(which.min(nchar(as.character(full_name))))
# A tibble: 3 x 2
# Groups: short_name [3]
# short_name full_name
# <fct> <fct>
#1 a apple
#2 b bike
#3 c cat
Or using summarise
test %>%
group_by(short_name) %>%
summarise(full_name = first(full_name))
test %>%
group_by(short_name) %>%
summarise(full_name = sample(full_name, 1))
With data.table
, the options are
library(data.table)
setDT(test)[test[, .I[1], .(short_name)]$V1]
setDT(test)[test[, .I[sample(seq_len(.N), 1)], .(short_name)]$V1]
Upvotes: 2