Reputation: 59
I run into a bit of trouble with my code and would welcome any suggestion to make it run faster. I have a data frame that looks like that:
Name <- c("a","a","a","a","a","b","b","b","b","c")
Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")
More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)
So I have duplicated entries for each row in column Name (the number of duplicates can vary). For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. If an entry has an equal number of categories, I want to take one of most categories randomly. So in this case, the output dataframe would look like this:
Name <- c("a","b","c")
Category <- c("sun","dog","cat")
More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)
a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed. My function looks like this:
my_choosing_function <- function(x){
tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
snp_freq <- as.data.frame(table(tmp$consequence_type_tv))
best_hit <- snp_freq[order(-snp_freq$Freq),]
best_hit$SNP<-list_of_snps[x]
top<-best_hit[1,]
return(top)
}
trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)
Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. I do a lapply for each element of the list of unique values, then do a rbind of the whole thing.
As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 4 seconds for 100 lines, that would be a total of 34 hours for the lapply.
I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. Anyone has an idea? Thanks a lot for your help!
Upvotes: 2
Views: 484
Reputation: 4242
A couple slight tweaks on @mt1022's solution can produce a marginal speedup, nothing to phone home about, but might be of use if you find your data grows another order of magnitude.
library(data.table)
library(dplyr)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(5000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 40.459 0.180 40.743
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 4.196 0.052 4.267
Tweaks include
setDT()
instead of as.data.table()
to avoid making a copystats::runif()
to generate the random tiebreaker directly, this is of what data.table
is doing internally in the the random option of frank()
setkey()
to sort the table.I
, where the row within each group is equal to the number of observations, .N
in each group. (This returns the last row of each group)Results:
system.time({
dt.max <- setDT(d)[, .(Count = .N), keyby = .(Name, Category)]
dt.max[,rand := stats::runif(.N)]
setkey(dt.max,Name,Count, rand)
dt.max[dt.max[,.I[.N],by = .(Name,Category)]$V1,.(Name,Category,Count)]
})
# user system elapsed
# 1.722 0.057 1.750
Upvotes: 0
Reputation: 17289
data.table
instead of dplyr
.I suggest use data.table
because it runs faster. And in the data.table
way shown below, it randomly choose one in case of tie, not always the first one.
library(data.table)
library(dplyr)
library(microbenchmark)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(10000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 45.932 0.808 46.745
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 2.424 0.004 2.426
Upvotes: 3
Reputation: 886948
We can modify the Mode
function from here and then do a group by filter
library(dplyr)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
Upvotes: 1