Reputation: 3020
Let us assume following as my data table data
data <- setDT(structure(list(col1 = c(1, 2, 3, 4, 5), col2 = c(53, 45, 54,
97, 23), col3 = c("aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo",
"cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da", "ee ee ee ee et eh es er eg"
), col4 = c("aa bb ff ff","aa ff vv rr","dd dd rr gg",
"yy yy rr rr","uu uu uu ee")), .Names = c("col1", "col2", "col3", "col4"),
row.names = c(NA, -5L), class = "data.frame"))
col1 col2 col3 col4
1 53 aa aa aa aa ab ad af ae ar aa bb ff ff
2 45 bb bb bb bb bt by bu bi bo aa ff vv rr
3 54 cc cc cc cc cd cy ch cn cd dd dd rr gg
4 97 dd dd dd dd dt dy dj dk da yy yy rr rr
5 23 ee ee ee ee et eh es er eg uu uu uu ee
col3
has strings of words and I need to find that if the most frequently occurred word
in col3 appears in col4
or not. So output will look like as follows:
col1 col2 col3 col4 most_freq_word_in_cool3 out_col
1 53 aa aa aa aa ab ad af ae ar aa bb ff ff aa 1
2 45 bb bb bb bb bt by bu bi bo aa ff vv rr bb 0
3 54 cc cc cc cc cd cy ch cn cd dd dd rr gg cc 0
4 97 dd dd dd dd dt dy dj dk da yy yy rr rr dd 0
5 23 ee ee ee ee et eh es er eg uu uu uu ee ee 1
I tried the following solution
m_fre_word1 <- function(x) { string <- as.character(unlist(strsplit(x, " ")))
freq <- sort(table(string), decreasing = T)
wr <-names(freq)[1]
return(wr) }
data <- data[ , most_freq_word_in_cool3:= apply(data[ , .(col3)], 1, m_fre_word1)]
data <- data[ , out_col:= as.numeric(grepl(m_fre_word1(col3), col4))]
There is nothing wrong with this solution, but it is really slow. My data table is huge. I can't use this way so I am looking for a faster alternative. Could somebody suggest a faster alternative.
Thanks,
Upvotes: 3
Views: 190
Reputation: 92292
Here's an attempt. Instead of running this whole thing on each row, I would suggest to split the column and operate on a long format.
I'm stealing the Mode
function from here and it is defined as follows
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Now using the newest data.table
version we could do
library(data.table) # v 1.9.6+
temp <- setDT(data6)[, tstrsplit(col3, " ", fixed = TRUE)]
data6[, res := melt(temp[, indx := .I], id = "indx")[, Mode(value), by = indx]$V1]
data6
# col1 col2 col3 col4 res
# 1: 1 53 aa aa aa aa ab ad af ae ar aa bb ff ff aa
# 2: 2 45 bb bb bb bb bt by bu bi bo aa ff vv rr bb
# 3: 3 54 cc cc cc cc cd cy ch cn cd dd dd rr gg cc
# 4: 4 97 dd dd dd dd dt dy dj dk da yy yy rr rr dd
# 5: 5 23 ee ee ee ee et eh es er eg uu uu uu ee ee
The second step can be easily achieved by eitehr
data6[, out := +grepl(res, col4, fixed = TRUE), by = res]
OR (not sure which one is faster)
library(stringi)
data6[stri_detect_fixed(col4, res), out := 1L]
As a side note, when using reference semantics, there is no need to copy the whole data set and reassign it using <-
, in fact that's the whole point of reference semantics. Please read this.
Upvotes: 2
Reputation: 1677
(Edited as per comments below)
Load the library
require(data.table)
Define the data
x <-
data.table(
col1 = c(1, 2, 3, 4, 5),
col2 = c(53, 45, 54, 97, 23),
col3 = c(
"aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo",
"cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da",
"ee ee ee ee et eh es er eg"),
col4 = c(
"aa bb ff ff","aa ff vv rr","dd dd rr gg",
"yy yy rr rr","uu uu uu ee")
)
Find the most freqent element in col3
x[,most_freq_word_in_col3:=sapply(col3,function(e){
names(sort(table(unlist(strsplit(e," "))),decreasing=TRUE)[1])})]
Check whether this element in col4
x[,out_col:=apply(cbind(most_freq_word_in_col3,col4),1,function(e){
as.numeric(e[1] %in% unlist(strsplit(e[2]," ")))})]
Output:
> x
col1 col2 col3 col4 most_freq_word_in_col3 out_col
1: 1 53 aa aa aa aa ab ad af ae ar aa bb ff ff aa 1
2: 2 45 bb bb bb bb bt by bu bi bo aa ff vv rr bb 0
3: 3 54 cc cc cc cc cd cy ch cn cd dd dd rr gg cc 0
4: 4 97 dd dd dd dd dt dy dj dk da yy yy rr rr dd 0
5: 5 23 ee ee ee ee et eh es er eg uu uu uu ee ee 1
Upvotes: 1
Reputation: 6372
I think the apply(data[ , .(col3)]
is what is slowing down your code. Generally I have found putting a subset within a data.table call causes huge slowdowns, as the subset operation is expensive.
You can try:
DT[ , test := names(sort(table(strsplit(col3," ")), decreasing = T))[1], by = col1]
DT[, search := gsub(" ","|",col4)]
DT[, output := grepl(search,test), by = col1]
Not sure if my or David's answer will be faster.
EDIT: Based on Frank's input , the last two lines can be replaced with:
DT[, output := mapply(grepl,gsub(" ","|",col4),test)]
Upvotes: 2