A. Floyd
A. Floyd

Reputation: 23

R - how to loop through a dataframe to match multiple substrings - concatenate all matches in a new column

I am quite new to R - have worked on this all day but am out of ideas.

I have a dataframe with long descriptions in one column, eg: df:

ID  Name  Description
 1  A     ABC DEF
 2  B     ARS XUY
 3  C     ASD

And I have a vector of search terms:

ABC
ARS
XUY
DE

I would like to go through each row in the dataframe and search the Description for any of the search terms. I then want all matches to be concatenated in a new column in the dataframe, e.g.:

ID  Name  Description  Matches
 1  A     ABC DEF      ABC
 2  B     ARS XUY      ARS;XUY
 3  C     ASD

I would want to search ~100k rows with 1000 search terms. Does anyone have any ideas? I was able to get a matrix with sapply and grepl, but I'd rather have a concatenated solution.

Upvotes: 1

Views: 1748

Answers (3)

tblznbits
tblznbits

Reputation: 6778

Another option, which I tried to use in the comments, is to use the stringr package. There are two potential downsides to this approach: 1) it uses regex, and 2) it returns the search term matched instead of the value found.

library(stringr)
df = data.frame(Name=LETTERS[1:3], 
                Description=c("ABC DEF", "ARS XUY", "ASD"),
                stringsAsFactors=F)
search_terms = c("ABC", "ARS", "XUY", "DE")
regex = paste(search_terms, collapse="|")
df$Matches = sapply(str_extract_all(df$Description, regex), function(x) paste(x, collapse=";"))
df
#    Name Description Matches
#   (chr)       (chr)   (chr)
# 1     A     ABC DEF  ABC;DE
# 2     B     ARS XUY ARS;XUY
# 3     C         ASD  

With that being said, I think Alistaire's solution is the better approach since it doesn't use regex.

Upvotes: 1

bgoldst
bgoldst

Reputation: 35314

Here's an alternative:

df <- data.frame(ID=c(1L,2L,3L),Name=c('A','B','C'),Description=c('ABC DEF','ARS XUY','ASD'),stringsAsFactors=F);
st <- c('ABC','ARS','XUY','DE');
df$Matches <- apply(sapply(paste0('\\b',st,'\\b'),grepl,df$Description),1L,function(m) paste(collapse=';',st[m]));
df;
##   ID Name Description Matches
## 1  1    A     ABC DEF     ABC
## 2  2    B     ARS XUY ARS;XUY
## 3  3    C         ASD

Upvotes: 0

alistaire
alistaire

Reputation: 43334

One option using strsplit and %in% instead of regex:

df$Matches <- sapply(strsplit(as.character(df$Description), '\\s'), 
                     function(x){paste(search[search %in% x], collapse = ';')})
df
#   ID Name Description Matches
# 1  1    A     ABC DEF     ABC
# 2  2    B     ARS XUY ARS;XUY
# 3  3    C         ASD  

data:

search <- c("ABC", "ARS", "XUY", "DE")
df <- structure(list(ID = 1:3, Name = structure(1:3, .Label = c("A", 
          "B", "C"), class = "factor"), Description = structure(1:3, .Label = c("ABC DEF", 
          "ARS XUY", "ASD"), class = "factor"), Matches = c("ABC", "ARS;XUY", 
          "")), .Names = c("ID", "Name", "Description", "Matches"), row.names = c(NA, 
          -3L), class = "data.frame")

Upvotes: 2

Related Questions