Reputation: 59

R: Suggestion to speed up a function (remove duplicates in data frame)

I run into a bit of trouble with my code and would welcome any suggestion to make it run faster. I have a data frame that looks like that:

Name <- c("a","a","a","a","a","b","b","b","b","c")

Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")

More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)

So I have duplicated entries for each row in column Name (the number of duplicates can vary). For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. If an entry has an equal number of categories, I want to take one of most categories randomly. So in this case, the output dataframe would look like this:

Name <- c("a","b","c")

Category <- c("sun","dog","cat")

More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)

a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed. My function looks like this:

    my_choosing_function <- function(x){
      tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
      snp_freq <- as.data.frame(table(tmp$consequence_type_tv)) 
       best_hit <- snp_freq[order(-snp_freq$Freq),]
      best_hit$SNP<-list_of_snps[x]
      top<-best_hit[1,]
      return(top)
    }
    trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)

Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. I do a lapply for each element of the list of unique values, then do a rbind of the whole thing.

As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 4 seconds for 100 lines, that would be a total of 34 hours for the lapply.

I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. Anyone has an idea? Thanks a lot for your help!

Upvotes: 2

Answers (3)

Matt Summersgill

Reputation: 4242

A couple slight tweaks on @mt1022's solution can produce a marginal speedup, nothing to phone home about, but might be of use if you find your data grows another order of magnitude.

library(data.table)
library(dplyr)

d <- data.frame(
 Name = as.character(sample.int(10000, 2.5e6, replace = T)),
 Category = as.character(sample.int(5000, 2.5e6, replace = T)),
 More_info = rep('table', 2.5e6)
)

Mode <- function(x) {
 ux <- unique(x)
 fr1 <- tabulate(match(x, ux))
 if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

system.time({
 d %>%
   group_by(Name) %>%
   slice(which(Category == Mode(Category))[1])
})

# user   system elapsed 
# 40.459   0.180  40.743 

system.time({
 dt <- as.data.table(d)
 dt.max <- dt[, .N, by = .(Name, Category)]
 dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
 dt.max <- dt.max[r == 1, .(Name, Category)]

 dt[dt.max, on = .(Name, Category), mult = 'first']
})

# user  system elapsed 
# 4.196   0.052   4.267

Tweaks include

Use setDT() instead of as.data.table() to avoid making a copy
Using stats::runif() to generate the random tiebreaker directly, this is of what data.table is doing internally in the the random option of frank()
Using setkey() to sort the table
Sub-setting the table by the row indices, .I, where the row within each group is equal to the number of observations, .N in each group. (This returns the last row of each group)

Results:

system.time({
 dt.max <- setDT(d)[, .(Count = .N), keyby = .(Name, Category)]
 dt.max[,rand := stats::runif(.N)]
 setkey(dt.max,Name,Count, rand)
 dt.max[dt.max[,.I[.N],by = .(Name,Category)]$V1,.(Name,Category,Count)]
})

# user  system elapsed 
# 1.722   0.057   1.750

Upvotes: 0

mt1022

Reputation: 17289

Note: This should be a very long comment because I use `data.table` instead of `dplyr`.

I suggest use data.table because it runs faster. And in the data.table way shown below, it randomly choose one in case of tie, not always the first one.

library(data.table)
library(dplyr)
library(microbenchmark)

d <- data.frame(
    Name = as.character(sample.int(10000, 2.5e6, replace = T)),
    Category = as.character(sample.int(10000, 2.5e6, replace = T)),
    More_info = rep('table', 2.5e6)
)

Mode <- function(x) {
    ux <- unique(x)
    fr1 <- tabulate(match(x, ux))
    if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

system.time({
    d %>%
        group_by(Name) %>%
        slice(which(Category == Mode(Category))[1])
})
#    user  system elapsed
#  45.932   0.808  46.745

system.time({
    dt <- as.data.table(d)
    dt.max <- dt[, .N, by = .(Name, Category)]
    dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
    dt.max <- dt.max[r == 1, .(Name, Category)]

    dt[dt.max, on = .(Name, Category), mult = 'first']
})
#    user  system elapsed
#   2.424   0.004   2.426

Upvotes: 3

akrun

Reputation: 886948

We can modify the Mode function from here and then do a group by filter

library(dplyr)

Mode <- function(x) {
 ux <- unique(x)
 fr1 <- tabulate(match(x, ux))
  if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

d %>% 
  group_by(Name) %>%
  slice(which(Category == Mode(Category))[1])

Upvotes: 1

R: Suggestion to speed up a function (remove duplicates in data frame)

Answers (3)

Note: This should be a very long comment because I use data.table instead of dplyr.

Related Questions

Note: This should be a very long comment because I use `data.table` instead of `dplyr`.