Reputation: 773
My question is similar to this one, except a bit different. In the initial question, I was trying to count (for each row) how many columns satisfied a condition. I would like to do something similar, except that the condition involves several columns with an OR condition, and my real data has many columns, so ideally, I'd like to reference the columns using a regular expression.
I have the following data:
colnames <- c(paste("col",rep(LETTERS[1:2],each=4),rep(1:4,2),sep=""),c("meh","muh"))
df <- as.data.frame(matrix(sample(c("Yes","No"),200,replace=TRUE),ncol=10))
names(df) <- colnames
df
colA1 colA2 colA3 colA4 colB1 colB2 colB3 colB4 meh muh
1 No Yes No No No Yes Yes No Yes Yes
2 No Yes Yes Yes Yes No Yes No No No
3 No No No Yes No No No No Yes No
4 Yes No Yes Yes Yes Yes Yes Yes No Yes
5 Yes No Yes No No No No Yes No Yes
6 Yes No No No Yes Yes No No No No
7 Yes No No No Yes Yes Yes No Yes No
8 Yes No Yes No Yes Yes No Yes Yes No
9 No Yes No No No Yes Yes No No No
10 Yes Yes No No Yes No Yes No Yes No
11 No Yes No No Yes No Yes Yes No No
12 No Yes Yes Yes No No Yes No No No
13 No No Yes Yes No Yes Yes Yes Yes No
14 Yes Yes No No No No Yes No No Yes
15 Yes No Yes Yes No Yes No Yes No No
16 No Yes Yes No No No Yes No No No
17 Yes No No No No Yes Yes Yes No Yes
18 Yes No Yes Yes No No No No No Yes
19 No No No No No Yes No No No Yes
20 No Yes No No Yes Yes Yes No No No
I would like to create a new column Nb
that records, for each line: the number of times at least one of colA2, colA3,colA4 is =="Yes" plus the number of times at least one of colB2, colB3,colB4 is =="Yes".
If there was not this "OR" condition implied when look at a group of columns [colA2, colA3,colA4], and I was adding the number of columns satisfying the condition, I could have used something like:
df$Nb <- rowSums(df[, grep("^col[A-B][2-4]", names(df))] == "Yes")
I would like to use regex if possible to reference the columns, as in my real data letters and numbers go further than B and 5 respectively.
Thank you!
Upvotes: 1
Views: 583
Reputation: 78600
You could adapt your rowSums
approach to just the groups of columns in each of your OR conditions, then add > 0
to make it "at least one." Thus, "at least one of the A values is Yes" would look like:
rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0
Then you can combine them using +
:
(rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0) +
(rowSums(df[, grep("^colB[2-4]", names(df))] == "Yes") > 0)
Incidentally, you would have an easier time answering questions like these if your data were in a tidy format: that is, if each column were a separate variable. Right now it looks like you're storing attributes of your data (A, B, 1-4) as parts of your column names, which is the reason operations like "using columns with the value 'A'" are very awkward. What if you instead rearranged your data, using the dplyr and tidyr packages, as:
library(dplyr)
library(tidyr)
df$index <- 1:nrow(df)
newdf <- df %>% gather(key, value, colA1:colB4) %>%
separate(key, c("col", "letter", "number"), c(-3, -2)) %>%
mutate(number = as.numeric(number))
This rearranges your data as (note that I gave each of your rows its own "index" variable):
meh muh index col letter number value
1 Yes No 1 col A 1 Yes
2 Yes No 2 col A 1 Yes
3 No No 3 col A 1 Yes
4 Yes No 4 col A 1 No
5 Yes Yes 5 col A 1 No
6 Yes Yes 6 col A 1 Yes
You can then group, summarize, filter and manipulate these observations more naturally. For example, you seem to want to drop the columns with the number 1: rather than needing a regular expression, you could simply do newdf %>% filter(number > 1)
.
Here's how you would perform the kind of OR operation you're describing:
hasyes <- newdf %>% group_by(index, letter) %>% filter(number > 1) %>%
summarize(hasyes = any(value == "Yes"))
For each of your original row+letter combinations, you now have a logical value for whether Yes
appears:
index letter hasyes
1 1 A TRUE
2 1 B TRUE
3 2 A TRUE
4 2 B TRUE
5 3 A FALSE
6 3 B TRUE
One more summarizing operation gets this into the form you want:
result <- hasyes %>% group_by(index) %>% summarize(yeses = sum(hasyes))
What's important about this solution is that it will work for any number of letters you have (that is, if it goes from A-Z instead of just A and B) equally easily.
Upvotes: 3