Reputation: 23
I am quite new to R - have worked on this all day but am out of ideas.
I have a dataframe with long descriptions in one column, eg:
df
:
ID Name Description
1 A ABC DEF
2 B ARS XUY
3 C ASD
And I have a vector of search terms:
ABC
ARS
XUY
DE
I would like to go through each row in the dataframe and search the Description
for any of the search terms. I then want all matches to be concatenated in a new column in the dataframe, e.g.:
ID Name Description Matches
1 A ABC DEF ABC
2 B ARS XUY ARS;XUY
3 C ASD
I would want to search ~100k rows with 1000 search terms.
Does anyone have any ideas? I was able to get a matrix with sapply
and grepl
, but I'd rather have a concatenated solution.
Upvotes: 1
Views: 1748
Reputation: 6778
Another option, which I tried to use in the comments, is to use the stringr
package. There are two potential downsides to this approach: 1) it uses regex, and 2) it returns the search term matched instead of the value found.
library(stringr)
df = data.frame(Name=LETTERS[1:3],
Description=c("ABC DEF", "ARS XUY", "ASD"),
stringsAsFactors=F)
search_terms = c("ABC", "ARS", "XUY", "DE")
regex = paste(search_terms, collapse="|")
df$Matches = sapply(str_extract_all(df$Description, regex), function(x) paste(x, collapse=";"))
df
# Name Description Matches
# (chr) (chr) (chr)
# 1 A ABC DEF ABC;DE
# 2 B ARS XUY ARS;XUY
# 3 C ASD
With that being said, I think Alistaire's solution is the better approach since it doesn't use regex.
Upvotes: 1
Reputation: 35314
Here's an alternative:
df <- data.frame(ID=c(1L,2L,3L),Name=c('A','B','C'),Description=c('ABC DEF','ARS XUY','ASD'),stringsAsFactors=F);
st <- c('ABC','ARS','XUY','DE');
df$Matches <- apply(sapply(paste0('\\b',st,'\\b'),grepl,df$Description),1L,function(m) paste(collapse=';',st[m]));
df;
## ID Name Description Matches
## 1 1 A ABC DEF ABC
## 2 2 B ARS XUY ARS;XUY
## 3 3 C ASD
Upvotes: 0
Reputation: 43334
One option using strsplit
and %in%
instead of regex:
df$Matches <- sapply(strsplit(as.character(df$Description), '\\s'),
function(x){paste(search[search %in% x], collapse = ';')})
df
# ID Name Description Matches
# 1 1 A ABC DEF ABC
# 2 2 B ARS XUY ARS;XUY
# 3 3 C ASD
search <- c("ABC", "ARS", "XUY", "DE")
df <- structure(list(ID = 1:3, Name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), Description = structure(1:3, .Label = c("ABC DEF",
"ARS XUY", "ASD"), class = "factor"), Matches = c("ABC", "ARS;XUY",
"")), .Names = c("ID", "Name", "Description", "Matches"), row.names = c(NA,
-3L), class = "data.frame")
Upvotes: 2