Reputation: 49

Comparing values in multiple columns and populating additional column based on the comparison results

I have a table with dates in column 1 (Column titled "Date") and values in columns 2 through 4 (Columns titled "A"->"C").

Date   A  B  C
1/1/16 X X Y 
1/2/16 X Y Y
1/3/16 X Y Z

etc.

I need to create additional 5th column D that contains the following:

For each row:

If 2 out of 3 values in any of the columns A, B or C = X then value in column D = X

If 2 out of 3 values in any of the columns A, B or C = Y then value in column D = Y

for anything else value in column D = Z

Desired output:

Date   A  B  C  D
1/1/16 X X Y X
1/2/16 X Y Y Y
1/3/16 X Y Z Z

I have tried the following logic recommended earlier for comparing values but it gives me a bunch of errors, apparently "=" sign is a no-no:

df$D <- NA  
df$D[df$A = df$B = X or df$B = df$C = X or df$A = df$C = Y] <- "X"
df$D[df$A = df$B = Y or df$B = df$C = Y or df$A = df$C = Y] <- "Y"
df$D[is.na(df$E)]               <- "Z"

Error: unexpected '=' in "df$D[df$A ="

What is the best way to do this?

Upvotes: 0

Answers (3)

Shenglin Chen

Reputation: 4554

Try to use ifelse:

 df$D<-ifelse(rowSums(df[,2:4]=='X')>=2,'X',ifelse(rowSums(df[,2:4]=='Y')>=2,'Y','Z'))

Upvotes: 0

lmo

Reputation: 38500

Here's a method that uses rowMeans.

# assign all of D to "Z"
df$D <- "Z"
# replace values
df$D[rowMeans(df[, 2:(length(df)-1)] == "X") > 0.5] <- "X"
df$D[rowMeans(df[, 2:(length(df)-1)] == "Y") > 0.5] <- "Y"

A nice feature is that it would grow if you add on more variables.

rowMeans(df[, 2:(length(df)-1)] == "X") will calculate, for each row, the proportion of elements that have an "X". Your threshold (2 out of 3) is 0.666666... I used 0.5 instead as it is sufficient for the given number of columns. If additional columns were added, you might increase this nearer to 2/3rds.

Upvotes: 1

Fridolin Linder

Reputation: 411

as Imo pointed out = is for assignment == is for comparison. Or is |. Here's a link to the logical operators.

There might be a nicer solution but here is the correct syntax for what you wanted to do:

df$D <- NA
df$D[(df$A == "X" & df$B == "X") | (df$A == "X" & df$C == "X") | 
         (df$B == "X" & df$C == "X")] <- "X"
df$D[(df$A == "Y" & df$B == "Y") | (df$A == "Y" & df$C == "Y") | 
         (df$B == "Y" & df$C == "Y")] <- "Y"
df$D[is.na(df$D)] <- "Z"

Upvotes: 0

Comparing values in multiple columns and populating additional column based on the comparison results

Answers (3)

Related Questions