epxlp
epxlp

Reputation: 71

Generate a variable that flags any occurrence of a specific string across multiple variables

Possibly a really simple problem, but my Stata-ingrained brain just can't figure this one out.

I am trying to generate a single 'case-status' variable in R that uses conditional input from multiple variables in a df. I can get it to work conditional on one variable, but am struggling to find a method that includes all variables.

The data looks similar to this:

id var1 var2 var3 .....
1 X Y <NA>
2 Y <NA> <NA>
3 <NA> X X

I can use case <- rep(NA, nrow(df)) followed by case[df$var1 == "X"] <- 1 to return this output:

head(case)
[1] 1 NA NA 

But what I really want to know is if there are any instances of X in any of the var variables, so output that looks like this:

head(case)
[1] 1 NA 1

So how can I change case[df$var1 == "X"] <- 1 to loop over all 'var' variables (in reality there are about 400 rather than 3)?

Upvotes: 2

Views: 174

Answers (3)

RHertel
RHertel

Reputation: 23788

You could try

case <- +!!rowSums(df=="X", na.rm=TRUE) 
case[case==0] <- NA
#> case
#[1]  1 NA  1

data

df <- structure(list(id = 1:3, var1 = structure(c(1L, 2L, 2L), .Label = 
c("X", "Y"), class = "factor"), var2 = structure(c(1L, NA, 1L), 
.Label = "Y", class = "factor"), var3 = structure(c(NA, NA, 1L), 
.Label = "X", class = "factor")), .Names = c("id", "var1", "var2", "var3"), 
class = "data.frame", row.names = c(NA, -3L))

Upvotes: 2

Dominic Comtois
Dominic Comtois

Reputation: 10401

What about this?

myData <- data.frame(id=1:3, var1=c("X", "Y", NA), 
                     var2=c("Y", NA, "X"), var3=c(NA, NA, "X"),
                     stringsAsFactors=F)

as.numeric(rowSums(myData[2:4] == "X", na.rm=TRUE) > 0)

Result:

[1] 1 0 1

Edit

To get the exact same results as you did (having NA where no "X" is present but at least one NA is present), try this:

ifelse(rowSums(myData[2:4] == "X", na.rm=TRUE) > 0, 1,
       ifelse(rowSums(is.na(myData[2:4])) > 0, NA, 0))

Result:

[1]  1 NA  1

Upvotes: 1

lmo
lmo

Reputation: 38500

To get a column that finds if any column in a row has an "X", one method using anyis as follows:

# set up example data
df <- data.frame(id=1:3, var1=c("X", "Y", NA), var2=c("Y", NA, "X"), var1=c(NA, NA, "X"),
                 stringsAsFactors=F)

df$newVec <- as.integer(apply(df[,-1], 1, function(i) any(i == "X", na.rm=T)))

This returns 1s and 0s, if instead you want NAs where every value of the row is NA, use

df$newVec <- as.integer(apply(df[,-1], 1, function(i) any(i == "X")))

Here is one way in base R to replace all X values with 1s replacement using a for loop

for(i in 2:length(df)) df[df[, i] == "X" & !is.na(df[, i]), i] <- 1

You have to include !is.na in order to ignore the missing values. This should be pretty fast, since its replacement in place.

If you goal is to indicate whether or not a variable has an X, you can use any and sapply:

sapply(df, function(i) any(i == "X"))

Upvotes: 1

Related Questions