Reputation: 65
I have data frame with the below cols:
country<- c("CA","IN","US")
text <- c("paint red green", "painting red", "painting blue")
word <- c("green, red, blue", "red", "red, blue")
df <- data.frame(country, text, word)
For each row I want to find the words in the word column within the text in the text column and present them in a new column, so there will be shown the founded words in the text, separated by comma. so the new column should be:
df$new_col <- c("green,red","red","blue")
I am using these lines of code, but it take to much time to run and even collapse.
setDT(df)[, new_col:= paste(df$word[unlist(lapply(df$word,function(x) grepl(x, df$text,
ignore.case = T)))], collapse = ","), by = 1:nrow(df)]
Is there a way to change the code so it will be more efficient?
Thank a lot!
Upvotes: 0
Views: 254
Reputation: 101317
Another base R solution using mapply
+ grep
+ regmatches
, i.e.,
df <- within(df, newcol <- mapply(function(x,y) toString(grep(x,y,value = TRUE)),
gsub("\\W+","|",word),
regmatches(text,gregexpr("\\w+",text))))
such that
> df
country text word newcol
1 CA paint red green green, red, blue red, green
2 IN painting red red red
3 US painting blue red, blue blue
Upvotes: 1
Reputation: 79208
library(tidyverse)
df %>%
mutate(newcol = stringr::str_extract_all(text,gsub(", +","|",word)))
country text word newcol
1 CA paint red green green, red, blue red, green
2 IN painting red red red
3 US painting blue red, blue blue
In this case, newcol
is a list. To make it a string, we can do:
df%>%
mutate(newcol = text %>%
str_extract_all(gsub(", +", "|", word)) %>%
invoke(toString, .))
with data.table, you could do:
df[,id := .I][,newcol := do.call(toString,str_extract_all(text,gsub(', +',"|",word))),
by = id][, id := NULL][]
country text word newcol
1: CA paint red green green, red, blue red, green
2: IN painting red red red
3: US painting blue red, blue blue
Upvotes: 1
Reputation: 10375
Try this
mapply(function(x,y){paste(intersect(x,y),collapse=", ")},
strsplit(as.character(df$text),"\\, | "),
strsplit(as.character(df$word),"\\, | "))
[1] "red, green" "red" "blue"
Upvotes: 4