Reputation: 529
I thought I had a really simple dataframe transormation but for a reason I cannot fathom it seems to be taking an eternity which makes me suspect it probably isn't doing what I had hoped. Can anyone shed some light?
Part 1 - Transform source data into separate columns (actual df has 2.6M rows).
Given...
> V1 <- c("E11 2286 1", "ECAT 2286 1", "M11 2286 1", "M12 2286 1", "MCAT 2286 1", "C24 2287 1")
> df <- data.frame(V1)
> df
V1
1 E11 2286 1
2 ECAT 2286 1
3 M11 2286 1
4 M12 2286 1
5 MCAT 2286 1
6 C24 2287 1
I want to create two new columns (itemID & topic) and populate each column with a substr of corresponding row in V1.
This I can do using;
> require(stringr)
> df$itemID <- sapply(1:nrow(df), function(i) str_split(df[i,"V1"]," ")[[1]][[2]] )
> df$topic <- sapply(1:nrow(df), function(i) str_split(df[i,"V1"]," ")[[1]][[1]] )
BUT this takes several minutes and seems like there should be a more efficient way. So firstly I tried using sapply;
> sapply(1:nrow(df), function(i) {
t <- str_split(df[i,"V1"]," ")
df$itemID <- t[[1]][[2]]
df$topic <- t[[1]][[1]]
})
After over an hour, nothing. So I bailed as this was obviously getting nowhere when individual commands would have taken under 20 minutes.
Next option was to try ddply on just a single task and this failed as well.
> require(plyr)
> require(stringr)
> df$itemID <- ddply(df, .(V1), str_split(df$V1," ")[[1]][[2]], .progress="text" )
Error in get(as.character(FUN), mode = "function", envir = envir) :
object '2286' of mode 'function' was not found
So for part one of this task can anyone;
Part 2 - Gather up all topics for an itemID For bonus points... The second part of the task I need is to take the 2.6M rows (now in 3 columns) and collapse each row for an itemID so that all the topics are held in a single cell.
The output should end up looking like...
itemID topic
1 2286 E11,ECAT,M11,M12,MCAT
2 2287 C24
Can anyone suggest a simple method for such a gathering of rows into a single cell?
Upvotes: 2
Views: 199
Reputation: 118849
How about this? Using data.table v1.9.5:
require(data.table)
cols = c("topic", "itemID", "tmp")
setDT(df)[, c(cols) := tstrsplit(V1, " ", fixed=TRUE, type.convert=TRUE)]
df[, .(topic=paste(topic, collapse=", ")), by=itemID]
# itemID topic
# 1: 2286 E11, ECAT, M11, M12, MCAT
# 2: 2287 C24
N = 2.6e6L
x = paste(rep(letters, length.out=N), sample(1e4, N, TRUE), "1", sep=" ")
dat = data.frame(x, stringsAsFactors=FALSE)
nrow(dat) # 2.6 million
# dplyr+tidyr
system.time({ans1 <- extract(dat, x, into= c('topic', 'itemID'),
'([^ ]+) ([^ ]+).*', convert=TRUE) %>%
group_by(itemID) %>%
summarise(topic=toString(topic))})
# user system elapsed
# 45.643 0.854 46.777
# data.table
system.time({
cols = c("topic", "itemID", "tmp")
setDT(dat)[, c(cols) := tstrsplit(x, " ", fixed=TRUE, type.convert=TRUE)]
ans2 <- dat[, .(topic=paste(topic, collapse=", ")), by=itemID]
})
# user system elapsed
# 1.906 0.064 1.981
identical(as.data.frame(ans1), setDF(ans2[order(itemID)]))
# [1] TRUE
The speedup is ~24x.
Update: Running data.table
answer first and then dplyr
answer results in run times of 7s and 44s, yielding a speedup of ~6.3x. It seems like there is some cache efficiency in data.table method when run after dplyr
.
Upvotes: 4
Reputation: 887541
We could use a couple of options to improve the speed.
1. stringi
Functions in stringi
package are usually faster. We can extract the alphanumeric characters using stri_extract_all_regex
with the appropriate regex
. Here, I am using [[:alnum:]]{2,}
based on the example showed. rbind
the list elements (do.call(rbind.data.frame,..)
), change the column names with setNames
, convert the 'data.frame' to 'data.table' (setDT
), and paste
the 'topic' elements grouped by 'itemID' (toString
- is a wrapper for paste(., collapse=', ')
).
library(stringi)
library(data.table)
setDT(setNames(do.call(rbind.data.frame,stri_extract_all_regex(df$V1,
'[[:alnum:]]{2,}')), c('topic', 'itemID')))[,
list(topic=toString(topic)), itemID]
# itemID topic
#1: 2286 E11, ECAT, M11, M12, MCAT
#2: 2287 C24
2. dplyr/tidyr
We can use extract
from tidyr
to convert the single column to multiple columns by specifying the appropriate regex and paste
the 'topic' elements grouped by 'itemID'
library(dplyr)
library(tidyr)
extract(df, V1, into= c('topic', 'itemID'), '([^ ]+) ([^ ]+).*',
convert=TRUE) %>%
group_by(itemID) %>%
summarise(topic=toString(topic))
# itemID topic
#1 2286 E11, ECAT, M11, M12, MCAT
#2 2287 C24
Upvotes: 6