aaron parrilla
aaron parrilla

Reputation: 97

Determine if sub string appears in a string by row of dataframe

I have a dataframe that is revised every day. When an error occurs, It's checked, and if it can be solved, then the keyword "REVISED" is added to the beginning of the error message. Like so:

ID  M1               M2                M3        
1   NA               "REVISED-error"   "error"    
2   "REVISED-error"  "REVISED-error"   NA        
3   "REVISED-error"  "REVISED-error"   "error"   
4   NA               "error"           NA         
5   NA               NA                NA           

I want to find a way to add two columns, helping me determine if there are any error, and how many of them have been revised. Like this:

ID  M1               M2                M3         i1   ix
1   NA               "REVISED-error"   "error"    2    1    <- 2 errors, 1 revised
2   "REVISED-error"  "REVISED-error"   NA         2    2
3   "REVISED-error"  "REVISED-error"   "error"    3    2
4   NA               "error"           NA         1    0
5   NA               NA                NA         0    0

I found this code:

df <- df%>%mutate(i1 = rowSums(!is.na(.[2:4])))

That helps me to know how many errors are in those specific columns. How can I know if any of said errors contains the keyword REVISED? I've tried a few things but none have worked so far:

df <- df%>% mutate(i1 = rowSums(!is.na(.[2:4])))%>% mutate(ie = rowSums(.[2:4) %in% "REVISED")

This returns an error x must be an array of at least two dimensions

Upvotes: 1

Views: 50

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388907

You could use apply to find number of times "error" and "REVISED" appears in each row.

df[c("i1", "ix")] <- t(apply(df[-1], 1, function(x) 
                  c(sum(grepl("error", x)), sum(grepl("REVISED", x)))))


df
#  ID            M1            M2    M3 i1 ix
#1  1          <NA> REVISED-error error  2  1
#2  2 REVISED-error REVISED-error  <NA>  2  2
#3  3 REVISED-error REVISED-error error  3  2
#4  4          <NA>         error  <NA>  1  0
#5  5          <NA>          <NA>  <NA>  0  0

Althernative approach using is.na and rowSums to calculate i1.

df$i1 <- rowSums(!is.na(df[-1]))
df$ix <- apply(df[-1], 1, function(x) sum(grepl("REVISED", x)))

data

df <- structure(list(ID = 1:5, M1 = structure(c(NA, 1L, 1L, NA, NA), 
.Label = "REVISED-error", class = "factor"), 
M2 = structure(c(2L, 2L, 2L, 1L, NA), .Label = c("error", 
"REVISED-error"), class = "factor"), M3 = structure(c(1L, 
NA, 1L, NA, NA), .Label = "error", class = "factor")), row.names = c(NA, 
-5L), class = "data.frame")

Upvotes: 2

stevec
stevec

Reputation: 52268

You can use str_count() from the stringr library to count the number of times REVISED appears, like so

df <- data.frame(M1=as.character(c(NA, "REVISED-x", "REVISED-x")),
                 M2=as.character(c("REVISED-x", "REVISED-x", "REVISED-x")), 
                 stringsAsFactors = FALSE)

library(stringr)
df$ix <- str_count(paste0(df$M1, df$M2), "REVISED")

df

#          M1        M2 ix
# 1      <NA> REVISED-x  1
# 2 REVISED-x REVISED-x  2
# 3 REVISED-x REVISED-x  2

Upvotes: 1

Related Questions