Reputation: 95
I have a dataset of character variables:
col1 = c("a","b","c")
col2 = c("a","b_a","d")
df = data.frame(col1,col2)
col1 col2
1 a a
2 b b_a
3 c d
I want to create a variable a that flags 1,0 if any value in that row contains the substring "a".
col1 col2 a
1 a a 1
2 b b_a 1
3 c d 0
My attempt is below. It doesn't quite do it, as I believe it takes TRUE
if any value in the dataframe contains the substring, rather than the row.
df["a"] = ifelse(any(sapply(df,function(x) str_detect(x,"a")),TRUE),1,0)
My thinking was that with an ifelse
statement, any functions within the ifelse
statement only evaluates df[i,]
rather than the entire dataframe where i
is the row it is looking at. This doesn't seem to be the case.
How do I construct the data frame I'm looking for? Note that in my real dataset, there are 100+ columns, so it doesn't make sense to list them all out.
Why doesn't ifelse only evaluate row i
of df
, rather than the whole df
?
Note that previous questions only look at one variable, I am looking at all variables so this is not a duplicate.
Upvotes: 2
Views: 1538
Reputation: 31452
You can use
grepl('a', paste0(df$col1, df$col2))
Or to generalise for any number of columns
grepl('a', do.call(paste0, df))
And a third option, which may be safer if you are searching for multi-character substrings, rather than single letters. In this case you may want to avoid using paste
so that e.g. searching for 'ab'
in the vector c('xa', 'bx')
does not give a false positive. In this situation, we can use:
substr = 'a'
as.logical(colSums(apply (df, 1, function(x) grepl(substr, x))))
Upvotes: 4
Reputation: 9081
1) How do I construct the data frame I'm looking for?
df$a <- apply(df,1,function(x) {
as.numeric( length(grep("a",x)) > 0)
})
Output
col1 col2 a
1 a a 1
2 b b_a 1
3 c d 0
2) Why doesn't ifelse only evaluate row i of df, rather than the whole df?
Let's break it down -
You are doing sapply(df,function(x) str_detect(x,"a"))
which will give you this -
col1 col2 a
[1,] TRUE TRUE FALSE [2,] FALSE TRUE FALSE [3,] FALSE FALSE FALSE
Next you do any(sapply(df,function(x) str_detect(x,"a")),TRUE)
- this is where things are going wrong. any
is not being applied row wise and the output is a single boolean value. You have if apply the any
function row wise to get what you want.
Upvotes: 1