Luiz
Luiz

Reputation: 81

Conditional sum in R – multiple columns

I'm trying to figure out how to extract some specific information from very big tables (e.g., 30'000 rows and 50 columns).

Imagine I have this data frame:

S1 <- c(1,2,1,1,3,1)
S2 <- c(2,1,3,2,1,1)
S3 <- c(1,2,2,1,3,1)
S4 <- c(3,3,4,2,3,1)
S5 <- c(3,2,5,3,2,2)
count <- c(10,5,3,1,1,1)
df <- data.frame(count,S1,S2,S3,S4,S5)

What I need is to sum the column "count" when, for instance, S1 and S3 shares the same value (it doesn't matter which value), but no other column has the same value.

In this example, it should returns the value 11, because I should only take into consideration the values of the column "count" from the rows 1 and 4.

In the rows 2, 5 and 6, S1 and S3 have a similar value, but I don't want consider them because there are also other columns with the same value. And finally, not considering row 3 simply because S1 and S3 have different values.

I know how to do it easily in excel, but I was wondering how I could do it in R. I've tried somme commands from dplyr, but I failed.

If any of you could give a help, I'll be very grateful.

Upvotes: 1

Views: 5135

Answers (3)

moodymudskipper
moodymudskipper

Reputation: 47310

Using dplyr, rowwise, filter :

library(dplyr)
df %>%
  rowwise() %>%
  filter(S1==S3 & !S1 %in% c(S2,S4,S5)) %>% 
  pull(count) %>% 
  sum() 
# [1] 11

Upvotes: 4

Patricio Moracho
Patricio Moracho

Reputation: 717

A little more complex, but it works. Using only R base. From this question take the form of comparing multiple columns in a simple way.

sum(df[df$S1==df$S3 & rowSums(sapply(df[,c(3,5,6)],`==`,e2=df$S1)) == 0,1])

[1] 11

The most complex part is how to check multiple columns. In this case we use sapply to compare the columns c(3,5,6) by equality ('==') with S1, (e2 is the second argument of the == function).

As ycw mentions, it can be a little complicated to define all the columns by a vector, so this form allows you to check all the columns except those we don't want.

sum(df[df$S1==df$S3 & rowSums(sapply(df[,!(colnames(df) %in% c("count", "S1", "S3"))],`==`,e2=df$S1)) == 0,1])

Applying the same procedure to the two comparisons and defining only the vector of the same values:

equals <- c("S1", "S3")
not_equals <- !(colnames(df) %in% c("count", equals))

sum(df[rowSums(sapply(df[,equals,drop=FALSE],`==`,e2=df[equals[1]])) == length(equals) &
           rowSums(sapply(df[,not_equals,drop=FALSE],`==`,e2=df[equals[1]])) == 0, 1])

Note: Use drop=FALSE for selecting only one column of dataframe and avoid "promotion to vector" problem or omit the , this way:

sum(df[rowSums(sapply(df[equals],`==`,e2=df[equals[1]])) == length(equals) &
           rowSums(sapply(df[not_equals],`==`,e2=df[equals[1]])) == 0, 1])

Upvotes: 2

www
www

Reputation: 39154

A solution using dplyr. There are two steps. The first filter function finds rows with S1 == S3. The second filter_at function checks columns other than S1, S3, and count all are not equal to S1, which should be the same as S3 after the first filter function.

library(dplyr)

df2 <- df %>%
  filter(S1 == S3) %>%
  filter_at(vars(-S1, -S3, -count), all_vars(. != S1))
df2
  count S1 S2 S3 S4 S5
1    10  1  2  1  3  3
2     1  1  2  1  2  3

Then the total count is as follows.

sum(df2$count)
[1] 11

Upvotes: 2

Related Questions