TMo
TMo

Reputation: 465

Count Occurrences of Substrings in One Column in Second Column R data.table

How do I split column a (by the space character) and count how many times any of the substrings in column a exist in column b?

library(data.table)
library(stringr)

dt = data.table(
    a = c('one', 'one two', 'one two three'),
    b = c('zero', 'none_or One?' , 'onetwothree')
)

I failed when I tried:

dt[ , .(
    str_count( b,
        pattern = str_split( a , pattern = ' ' )
    )
) ]
Error in UseMethod("type") : 
  no applicable method for 'type' applied to an object of class "list"

I expect:

   V1
1:  0
2:  1
3:  3

Upvotes: 1

Views: 242

Answers (4)

V. Lou
V. Lou

Reputation: 159

I came up with this solution:

dt[, sum(sapply(tstrsplit(a, " "), function(x) grepl(x, b))), .(a, b)]

Upvotes: 0

thelatemail
thelatemail

Reputation: 93803

Another couple of variations:

With just data.table:

dt[, mapply(\(sp,txt) sum(sapply(sp, \(x) grepl(x, txt))), strsplit(a, " "), b) ]
##[1] 0 1 3

With data.table and stringr:

dt[, mapply(\(sp,txt) sum(sapply(sp, \(x) str_detect(txt, x))), strsplit(a, " "), b) ]
##[1] 0 1 3

Upvotes: 3

Fabio Correa
Fabio Correa

Reputation: 1363

The first step is to create a regex, then apply the regex to each object.

Try this:

dt [, parsed:= str_split(a, " ")]  
dt [, regex := lapply(parsed, function(x) paste0(x, collapse = "|"))]
dt [, V1    := mapply (function(x,y) {str_extract_all(x,y)[[1]] |> length()}, b, regex)]
dt [, .(V1)]

Upvotes: 1

langtang
langtang

Reputation: 24722

dict = paste0(unique(unlist(strsplit(dt$a, " "))), collapse="|")
f <- function(s,dict) {
  res = gregexpr(dict,s)[[1]]
  return(ifelse(res[1]==-1,as.integer(0),length(res)))
}
dt[,f(b, dict), by=.(1:nrow(dt))][,.(V1)]


   V1
1:  0
2:  1
3:  3

Upvotes: 1

Related Questions