macworthy
macworthy

Reputation: 95

If any row contains a substring, then flag

I have a dataset of character variables:

col1 = c("a","b","c")
col2 = c("a","b_a","d")
df = data.frame(col1,col2)

  col1 col2
1    a    a
2    b  b_a
3    c    d

I want to create a variable a that flags 1,0 if any value in that row contains the substring "a".

  col1 col2 a
1    a    a 1
2    b  b_a 1
3    c    d 0

My attempt is below. It doesn't quite do it, as I believe it takes TRUE if any value in the dataframe contains the substring, rather than the row.

df["a"] = ifelse(any(sapply(df,function(x) str_detect(x,"a")),TRUE),1,0)

My thinking was that with an ifelse statement, any functions within the ifelse statement only evaluates df[i,] rather than the entire dataframe where i is the row it is looking at. This doesn't seem to be the case.

  1. How do I construct the data frame I'm looking for? Note that in my real dataset, there are 100+ columns, so it doesn't make sense to list them all out.

  2. Why doesn't ifelse only evaluate row i of df, rather than the whole df?

Note that previous questions only look at one variable, I am looking at all variables so this is not a duplicate.

Upvotes: 2

Views: 1538

Answers (2)

dww
dww

Reputation: 31452

You can use

grepl('a', paste0(df$col1, df$col2))

Or to generalise for any number of columns

grepl('a',  do.call(paste0, df))

And a third option, which may be safer if you are searching for multi-character substrings, rather than single letters. In this case you may want to avoid using paste so that e.g. searching for 'ab' in the vector c('xa', 'bx') does not give a false positive. In this situation, we can use:

substr = 'a'
as.logical(colSums(apply (df, 1, function(x) grepl(substr, x))))

Upvotes: 4

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

1) How do I construct the data frame I'm looking for?

df$a <- apply(df,1,function(x) { 
  as.numeric( length(grep("a",x)) > 0) 
  })

Output

  col1 col2 a
1    a    a 1
2    b  b_a 1
3    c    d 0

2) Why doesn't ifelse only evaluate row i of df, rather than the whole df?

Let's break it down -

  1. You are doing sapply(df,function(x) str_detect(x,"a")) which will give you this -

      col1  col2     a
    

    [1,] TRUE TRUE FALSE [2,] FALSE TRUE FALSE [3,] FALSE FALSE FALSE

  2. Next you do any(sapply(df,function(x) str_detect(x,"a")),TRUE) - this is where things are going wrong. any is not being applied row wise and the output is a single boolean value. You have if apply the any function row wise to get what you want.

Upvotes: 1

Related Questions