Reputation: 81
I'm trying to figure out how to extract some specific information from very big tables (e.g., 30'000 rows and 50 columns).
Imagine I have this data frame:
S1 <- c(1,2,1,1,3,1)
S2 <- c(2,1,3,2,1,1)
S3 <- c(1,2,2,1,3,1)
S4 <- c(3,3,4,2,3,1)
S5 <- c(3,2,5,3,2,2)
count <- c(10,5,3,1,1,1)
df <- data.frame(count,S1,S2,S3,S4,S5)
What I need is to sum the column "count" when, for instance, S1 and S3 shares the same value (it doesn't matter which value), but no other column has the same value.
In this example, it should returns the value 11, because I should only take into consideration the values of the column "count" from the rows 1 and 4.
In the rows 2, 5 and 6, S1 and S3 have a similar value, but I don't want consider them because there are also other columns with the same value. And finally, not considering row 3 simply because S1 and S3 have different values.
I know how to do it easily in excel, but I was wondering how I could do it in R. I've tried somme commands from dplyr, but I failed.
If any of you could give a help, I'll be very grateful.
Upvotes: 1
Views: 5135
Reputation: 47310
Using dplyr
, rowwise
, filter
:
library(dplyr)
df %>%
rowwise() %>%
filter(S1==S3 & !S1 %in% c(S2,S4,S5)) %>%
pull(count) %>%
sum()
# [1] 11
Upvotes: 4
Reputation: 717
A little more complex, but it works. Using only R base. From this question take the form of comparing multiple columns in a simple way.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,c(3,5,6)],`==`,e2=df$S1)) == 0,1])
[1] 11
The most complex part is how to check multiple columns. In this case we use sapply
to compare the columns c(3,5,6)
by equality ('=='
) with S1, (e2
is the second argument of the ==
function).
As ycw mentions, it can be a little complicated to define all the columns by a vector, so this form allows you to check all the columns except those we don't want.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,!(colnames(df) %in% c("count", "S1", "S3"))],`==`,e2=df$S1)) == 0,1])
Applying the same procedure to the two comparisons and defining only the vector of the same values:
equals <- c("S1", "S3")
not_equals <- !(colnames(df) %in% c("count", equals))
sum(df[rowSums(sapply(df[,equals,drop=FALSE],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[,not_equals,drop=FALSE],`==`,e2=df[equals[1]])) == 0, 1])
Note: Use drop=FALSE
for selecting only one column of dataframe and avoid "promotion to vector" problem or omit the ,
this way:
sum(df[rowSums(sapply(df[equals],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[not_equals],`==`,e2=df[equals[1]])) == 0, 1])
Upvotes: 2
Reputation: 39154
A solution using dplyr
. There are two steps. The first filter
function finds rows with S1 == S3
. The second filter_at
function checks columns other than S1
, S3
, and count
all are not equal to S1
, which should be the same as S3
after the first filter
function.
library(dplyr)
df2 <- df %>%
filter(S1 == S3) %>%
filter_at(vars(-S1, -S3, -count), all_vars(. != S1))
df2
count S1 S2 S3 S4 S5
1 10 1 2 1 3 3
2 1 1 2 1 2 3
Then the total count is as follows.
sum(df2$count)
[1] 11
Upvotes: 2