Chris
Chris

Reputation: 365

How do I return a data frame with logical columns which denote whether or not a string occurs in a column?

I have a dataframe with 3 columns I would like to search over. I have a list of strings I would like to search for in each column. I would like to return a dataframe with the original data, and a column for each string in the list of string and an indicator of if that string is found in that row's columns.

Here is a simplified version of something that approximates my data.

strings <- c("ape", "bear", "cat", "dog")

# A tibble: 7 x 3
                   snippet          headline       abstract
                     <chr>             <chr>          <chr>
1           this is an ape            An ape    some random
2           blah blah blah            An ape    some random
3 this is some random text  some random text some ape stuff
4           this is a bear    this is a bear      bear time
5            some cat text         bear time       dog time
6         cat and dog text         blah blah           blah
7           blah blah blah this is just text           blah

Output of dput(df):

dput(df)
structure(list(snippet = c("this is an ape", "blah blah blah", 
"this is some random text", "this is a bear", "some cat text", 
"cat and dog text", "blah blah blah"), headline = c("An ape", 
"An ape", "some random text", "this is a bear", "bear time", 
"blah blah", "this is just text"), abstract = c("some random", 
"some random", "some ape stuff", "bear time", "dog time", "blah", 
"blah")), .Names = c("snippet", "headline", "abstract"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -7L))

I would like it to return something like the following dataframe

# A tibble: 7 x 7
                   snippet          headline       abstract   ape  bear   cat   dog
                     <chr>             <chr>          <chr> <lgl> <lgl> <lgl> <lgl>
1           this is an ape            An ape    some random  TRUE FALSE FALSE FALSE
2           blah blah blah            An ape    some random  TRUE FALSE FALSE FALSE
3 this is some random text  some random text some ape stuff  TRUE FALSE FALSE FALSE
4           this is a bear    this is a bear      bear time FALSE  TRUE FALSE FALSE
5            some cat text         bear time       dog time FALSE  TRUE  TRUE FALSE
6         cat and dog text         blah blah           blah FALSE FALSE  TRUE  TRUE
7           blah blah blah this is just text           blah FALSE FALSE FALSE FALSE

I have used grepl to return the rows needed but there is clearly a better way to do this and keep track of which string is hitting for which row

Thank you in advance for your help

Upvotes: 1

Views: 91

Answers (1)

SymbolixAU
SymbolixAU

Reputation: 26258

As you don't need to specify which column the string is found in, you can collapse each row into a single string column, and search / grepl within that

something like

strings <- c("ape", "bear", "cat", "dog")

df$colStrings <- with(df, paste(snippet, headline, abstract, sep = ","))

sapply(strings, function(x) grepl(x, df$colStrings))

#        ape  bear   cat   dog
# [1,]  TRUE FALSE FALSE FALSE
# [2,]  TRUE FALSE FALSE FALSE
# [3,]  TRUE FALSE FALSE FALSE
# [4,] FALSE  TRUE FALSE FALSE
# [5,] FALSE  TRUE  TRUE  TRUE
# [6,] FALSE FALSE  TRUE  TRUE
# [7,] FALSE FALSE FALSE FALSE

Upvotes: 5

Related Questions