Sylv
Sylv

Reputation: 51

Find multiple partial strings in each row and create a variable with the column the string is in

I have found similar questions but nothing that has solved my problem yet. I have a huge dataframe and I'm trying to find where there are occurrences of a range of strings. Here is my sample data:

var1 <- c("Goats","Sheep","Pigs","Dog","Zebu","Donkeys","Water buffaloes","Dromedary camel","Dog","Pig")
var2 <- c("cats","Dog","birds","rabbits","Plant","fish","guinea pigs","cat","Mouse","dog")
var3 <- c("cats","Dog","birds","rabbits","horses","Sheep","Pigs","Tree","Zebu","Donkeys")
var4 <- c("Plant","Ant","Bee","Tree","Water buffaloes","Donkeys","Water buffaloes","Dromedary camel","Dog","Pig")

df <- data.frame(var1, var2, var3, var4)

head(df)
    var1    var2    var3            var4
1   Goats    cats    cats           Plant
2   Sheep     Dog     Dog             Ant
3    Pigs   birds   birds             Bee
4     Dog rabbits rabbits            Tree
5    Zebu   Plant  horses Water buffaloes
6 Donkeys    fish   Sheep         Donkeys
> 

I can find the row and column of each occurrence of Tree and plant:

strings = c( "Tree","Plant")

which(apply(df, 1, function(x) any(grepl(paste(strings, collapse = "|"), x))))# ROW
[1] 1 4 5 8

which(apply(df, 2, function(x) any(grepl(paste(strings, collapse = "|"), x)))) # COLUMN
var2 var3 var4 
   2    3    4 

But can't figure out how to put the column name (or indice) in a new variable for each row. This is what I would like:

> head(df)
     var1    var2    var3            var4     x1
1   Goats    cats    cats           Plant     var4
2   Sheep     Dog     Dog             Ant     NA
3    Pigs   birds   birds             Bee     NA 
4     Dog rabbits rabbits            Tree     var4
5    Zebu   Plant  horses Water buffaloes     var2
6 Donkeys    fish   Sheep         Donkeys     NA

I thought something with str_detect or str_which would would, but not sure how to do it across all columns. Something like:

df <- df %>%
  mutate(words = ifelse(any_vars(str_detect(strings))))

Appreciate your time and help.

Upvotes: 0

Views: 469

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388982

We can achieve this with rowwise -

library(dplyr)

cols <- names(df)

df %>%
  rowwise() %>%
  mutate(x1 = toString(cols[c_across() %in% strings])) %>%
  ungroup

#   var1            var2        var3    var4            x1    
#   <chr>           <chr>       <chr>   <chr>           <chr> 
# 1 Goats           cats        cats    Plant           "var4"
# 2 Sheep           Dog         Dog     Ant             ""    
# 3 Pigs            birds       birds   Bee             ""    
# 4 Dog             rabbits     rabbits Tree            "var4"
# 5 Zebu            Plant       horses  Water buffaloes "var2"
# 6 Donkeys         fish        Sheep   Donkeys         ""    
# 7 Water buffaloes guinea pigs Pigs    Water buffaloes ""    
# 8 Dromedary camel cat         Tree    Dromedary camel "var3"
# 9 Dog             Mouse       Zebu    Dog             ""    
#10 Pig             dog         Donkeys Pig             ""    

In base R, with apply -

df$x1 <- apply(df, 1, function(x) toString(cols[x %in% strings]))

Upvotes: 1

Related Questions