BarneyC
BarneyC

Reputation: 529

R Plyr Sapply seems to be really slow

I thought I had a really simple dataframe transormation but for a reason I cannot fathom it seems to be taking an eternity which makes me suspect it probably isn't doing what I had hoped. Can anyone shed some light?

Part 1 - Transform source data into separate columns (actual df has 2.6M rows).

Given...

> V1 <- c("E11 2286 1", "ECAT 2286 1", "M11 2286 1", "M12 2286 1", "MCAT 2286 1", "C24 2287 1")
> df <- data.frame(V1)
> df
           V1
1  E11 2286 1
2 ECAT 2286 1
3  M11 2286 1
4  M12 2286 1
5 MCAT 2286 1
6  C24 2287 1

I want to create two new columns (itemID & topic) and populate each column with a substr of corresponding row in V1.

This I can do using;

> require(stringr)
> df$itemID <- sapply(1:nrow(df), function(i) str_split(df[i,"V1"]," ")[[1]][[2]] )
> df$topic  <- sapply(1:nrow(df), function(i) str_split(df[i,"V1"]," ")[[1]][[1]] )

BUT this takes several minutes and seems like there should be a more efficient way. So firstly I tried using sapply;

> sapply(1:nrow(df), function(i) {
                                    t <- str_split(df[i,"V1"]," ")
                                    df$itemID <- t[[1]][[2]] 
                                    df$topic  <- t[[1]][[1]] 
                                  })

After over an hour, nothing. So I bailed as this was obviously getting nowhere when individual commands would have taken under 20 minutes.

Next option was to try ddply on just a single task and this failed as well.

> require(plyr)
> require(stringr)
> df$itemID <- ddply(df, .(V1), str_split(df$V1," ")[[1]][[2]], .progress="text"  )

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object '2286' of mode 'function' was not found

So for part one of this task can anyone;


Part 2 - Gather up all topics for an itemID For bonus points... The second part of the task I need is to take the 2.6M rows (now in 3 columns) and collapse each row for an itemID so that all the topics are held in a single cell.

The output should end up looking like...

  itemID    topic
1 2286      E11,ECAT,M11,M12,MCAT
2 2287      C24

Can anyone suggest a simple method for such a gathering of rows into a single cell?

Upvotes: 2

Views: 199

Answers (2)

Arun
Arun

Reputation: 118849

How about this? Using data.table v1.9.5:

require(data.table)
cols = c("topic", "itemID", "tmp")
setDT(df)[, c(cols) := tstrsplit(V1, " ", fixed=TRUE, type.convert=TRUE)]
df[, .(topic=paste(topic, collapse=", ")), by=itemID]
#    itemID                     topic
# 1:   2286 E11, ECAT, M11, M12, MCAT
# 2:   2287                       C24

Benchmark on 2.6 million rows:

N = 2.6e6L
x = paste(rep(letters, length.out=N), sample(1e4, N, TRUE), "1", sep=" ")
dat = data.frame(x, stringsAsFactors=FALSE)
nrow(dat) # 2.6 million

# dplyr+tidyr
system.time({ans1 <- extract(dat, x, into= c('topic', 'itemID'), 
         '([^ ]+) ([^ ]+).*', convert=TRUE) %>% 
          group_by(itemID) %>% 
         summarise(topic=toString(topic))})
#    user  system elapsed 
#  45.643   0.854  46.777 

# data.table
system.time({
    cols = c("topic", "itemID", "tmp")
    setDT(dat)[, c(cols) := tstrsplit(x, " ", fixed=TRUE, type.convert=TRUE)]
    ans2 <- dat[, .(topic=paste(topic, collapse=", ")), by=itemID]
})    
#    user  system elapsed 
#   1.906   0.064   1.981 

identical(as.data.frame(ans1), setDF(ans2[order(itemID)]))
# [1] TRUE

The speedup is ~24x.


Update: Running data.table answer first and then dplyr answer results in run times of 7s and 44s, yielding a speedup of ~6.3x. It seems like there is some cache efficiency in data.table method when run after dplyr.

Upvotes: 4

akrun
akrun

Reputation: 887541

We could use a couple of options to improve the speed.

1. stringi

Functions in stringi package are usually faster. We can extract the alphanumeric characters using stri_extract_all_regex with the appropriate regex. Here, I am using [[:alnum:]]{2,} based on the example showed. rbind the list elements (do.call(rbind.data.frame,..)), change the column names with setNames, convert the 'data.frame' to 'data.table' (setDT), and paste the 'topic' elements grouped by 'itemID' (toString- is a wrapper for paste(., collapse=', ')).

library(stringi)
library(data.table)
setDT(setNames(do.call(rbind.data.frame,stri_extract_all_regex(df$V1,
       '[[:alnum:]]{2,}')), c('topic', 'itemID')))[, 
          list(topic=toString(topic)), itemID]
#   itemID                     topic
#1:   2286 E11, ECAT, M11, M12, MCAT
#2:   2287                       C24

2. dplyr/tidyr

We can use extract from tidyr to convert the single column to multiple columns by specifying the appropriate regex and paste the 'topic' elements grouped by 'itemID'

library(dplyr)
library(tidyr)
 extract(df, V1, into= c('topic', 'itemID'), '([^ ]+) ([^ ]+).*', 
                        convert=TRUE) %>% 
           group_by(itemID) %>% 
           summarise(topic=toString(topic))
#  itemID                     topic
#1   2286 E11, ECAT, M11, M12, MCAT
#2   2287                       C24

Upvotes: 6

Related Questions