R Plyr Sapply seems to be really slow

Question

I thought I had a really simple dataframe transormation but for a reason I cannot fathom it seems to be taking an eternity which makes me suspect it probably isn't doing what I had hoped. Can anyone shed some light?

Part 1 - Transform source data into separate columns (actual df has 2.6M rows).

Given...

> V1 <- c("E11 2286 1", "ECAT 2286 1", "M11 2286 1", "M12 2286 1", "MCAT 2286 1", "C24 2287 1")
> df <- data.frame(V1)
> df
           V1
1  E11 2286 1
2 ECAT 2286 1
3  M11 2286 1
4  M12 2286 1
5 MCAT 2286 1
6  C24 2287 1

I want to create two new columns (itemID & topic) and populate each column with a substr of corresponding row in V1.

This I can do using;

> require(stringr)
> df$itemID <- sapply(1:nrow(df), function(i) str_split(df[i,"V1"]," ")[[1]][[2]] )
> df$topic  <- sapply(1:nrow(df), function(i) str_split(df[i,"V1"]," ")[[1]][[1]] )

BUT this takes several minutes and seems like there should be a more efficient way. So firstly I tried using sapply;

> sapply(1:nrow(df), function(i) {
                                    t <- str_split(df[i,"V1"]," ")
                                    df$itemID <- t[[1]][[2]] 
                                    df$topic  <- t[[1]][[1]] 
                                  })

After over an hour, nothing. So I bailed as this was obviously getting nowhere when individual commands would have taken under 20 minutes.

Next option was to try ddply on just a single task and this failed as well.

> require(plyr)
> require(stringr)
> df$itemID <- ddply(df, .(V1), str_split(df$V1," ")[[1]][[2]], .progress="text"  )

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object '2286' of mode 'function' was not found

So for part one of this task can anyone;

i) tell me which approach is likely to be quicker (sapply or ddply), and
ii) using that approach provide a solution as to how to split column V1 into the two required component columns?

Part 2 - Gather up all topics for an itemID For bonus points... The second part of the task I need is to take the 2.6M rows (now in 3 columns) and collapse each row for an itemID so that all the topics are held in a single cell.

The output should end up looking like...

  itemID    topic
1 2286      E11,ECAT,M11,M12,MCAT
2 2287      C24

Can anyone suggest a simple method for such a gathering of rows into a single cell?

akrun · Accepted Answer

We could use a couple of options to improve the speed.

1. stringi

Functions in stringi package are usually faster. We can extract the alphanumeric characters using stri_extract_all_regex with the appropriate regex. Here, I am using [[:alnum:]]{2,} based on the example showed. rbind the list elements (do.call(rbind.data.frame,..)), change the column names with setNames, convert the 'data.frame' to 'data.table' (setDT), and paste the 'topic' elements grouped by 'itemID' (toString- is a wrapper for paste(., collapse=', ')).

library(stringi)
library(data.table)
setDT(setNames(do.call(rbind.data.frame,stri_extract_all_regex(df$V1,
       '[[:alnum:]]{2,}')), c('topic', 'itemID')))[, 
          list(topic=toString(topic)), itemID]
#   itemID                     topic
#1:   2286 E11, ECAT, M11, M12, MCAT
#2:   2287                       C24

2. dplyr/tidyr

We can use extract from tidyr to convert the single column to multiple columns by specifying the appropriate regex and paste the 'topic' elements grouped by 'itemID'

library(dplyr)
library(tidyr)
 extract(df, V1, into= c('topic', 'itemID'), '([^ ]+) ([^ ]+).*', 
                        convert=TRUE) %>% 
           group_by(itemID) %>% 
           summarise(topic=toString(topic))
#  itemID                     topic
#1   2286 E11, ECAT, M11, M12, MCAT
#2   2287                       C24

R Plyr Sapply seems to be really slow

Answers (2)

Benchmark on 2.6 million rows:

Related Questions