mcburn
mcburn

Reputation: 13

Filter rows based upon 2 or more occurrences across columns in R

I am attempting to filter out rows in which there is an occurrence only in a single column, and show only those that appear in more than one.

An example database:

col1 col2 col3 col4 col5 col6 col7
row1 1 0 0 0 0 0 0
row2 6 0 42 0 0 0 0
row3 0 0 0 0 1 1 0
row4 0 0 0 0 0 0 87

In this case, I would want my filter to only return row2 and row3. I have been attempting to use dplyr and the filter feature; although I haven't found a way to capture all 7 columns. I'm unable to use rowSums() since the actual value isn't important in this case, only the frequency alongside another column.

I have thus far only accurately captured the output I am looking for when exclusively comparing two columns for values >0:

mydata.df %>%
  filter(col1>0 & col2>0)

Though it's easy to see why this approach isn't feasible, especially with 7 columns to compare. I have tried stringing multiple conditions to compare amongst multiple columns without success. Short of writing a separate filter for each of the combinations of columns independently, I'm sure I am missing an obvious and simple solution?

Upvotes: 1

Views: 1013

Answers (1)

akrun
akrun

Reputation: 887981

We can use filter with rowSums

library(dplyr)
mydata.df %>% 
   filter(rowSums(. > 0) >=2)
     col1 col2 col3 col4 col5 col6 col7
row2    6    0   42    0    0    0    0
row3    0    0    0    0    1    1    0

Or if it is pairwise combination of columns that we are checking

library(purrr)
mydata.df %>% 
    filter(combn(., 2, FUN = function(x) rowSums(x > 0) == 2, 
         simplify = FALSE) %>% 
      reduce(`|`))
     col1 col2 col3 col4 col5 col6 col7
row2    6    0   42    0    0    0    0
row3    0    0    0    0    1    1    0

data

mydata.df <- structure(list(col1 = c(1L, 6L, 0L, 0L), col2 = c(0L, 0L, 0L, 
0L), col3 = c(0L, 42L, 0L, 0L), col4 = c(0L, 0L, 0L, 0L), col5 = c(0L, 
0L, 1L, 0L), col6 = c(0L, 0L, 1L, 0L), col7 = c(0L, 0L, 0L, 87L
)), class = "data.frame", row.names = c("row1", "row2", "row3", 
"row4"))

Upvotes: 1

Related Questions