user1658170
user1658170

Reputation: 858

extracting rows depending on the value of multiple columns

I have a large dataframe that is simplified below. Given the following data frame structure, I need to collapse to return two distinct rows where col3 has different values but col1 and col2 have unique values.

dat <- data.frame("col1" = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1), 
       "col2" = c( "A","A", "A", "A", "A", "A", "A", "A", "A", "A"," A", "A", "A", "A", "A"),
       "col3" = c( "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Y", "Y", "Y", "Y", "Y", "Y"))

   col1 col2 col3
1     1    A    Z
2     1    A    Z
3     1    A    Z
4     1    A    Z
5     1    A    Z
6     1    A    Z
7     1    A    Z
8     1    A    Z
9     1    A    Z
10    1    A    Y
11    1    A    Y
12    1    A    Y
13    1    A    Y
14    1    A    Y
15    1    A    Y

So in this case I would need to return just the following:

   col1 col2 col3
      1    A    Z
      1    A    Y

If however, col3 was only z's I would return no rows. I can get counts of these data with the table function but I need to see the actual rows. Any ideas?

Thanks

Upvotes: 0

Views: 72

Answers (2)

Gopala
Gopala

Reputation: 10483

Try this with:

library(dplyr)
dat %>%
  group_by(col1, col2) %>%
  filter(length(unique(col3)) > 1) %>%
  distinct()

If dat is as follows:

dat <- structure(list(col1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1), col2 = c("A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A"), col3 = c("Z", "Z", "Z", "Z", "Z", 
"Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z")), class = "data.frame", row.names = c(NA, 
-15L))

You get no rows as follows:

# A tibble: 0 x 3
# Groups:   col1, col2 [0]
# ... with 3 variables: col1 <dbl>, col2 <chr>, col3 <chr>

If dat is as you provided in the original post, you get the output as you needed:

# A tibble: 2 x 3
# Groups:   col1, col2 [1]
   col1 col2  col3 
  <dbl> <chr> <chr>
1     1 A     Z    
2     1 A     Y    

Notice that I am using length(unique()) in the filter instead of n_distinct because there is a dplyr bug that makes n_distinct in a filter of a grouped data frame run extremely slowly.

Upvotes: 1

Mako212
Mako212

Reputation: 7292

Nice easy dplyr solution:

dat <- structure(list(col1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1), col2 = c("A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A"), col3 = c("Z", "Z", "Z", "Z", "Z", 
"Z", "Z", "Z", "Z", "Y", "Y", "Y", "Y", "Y", "Y")), class = "data.frame", row.names = c(NA, 
-15L))

library(dplyr)

dat %>% group_by(col1,col2) %>% distinct()

# A tibble: 2 x 3
# Groups:   col1, col2 [1]
   col1 col2  col3 
  <dbl> <chr> <chr>
1     1 A     Z    
2     1 A     Y

In your version of dat you have a space in one of your col2 values, if that's not a typo, you'd need to fix that first so that distinct() aggregates correctly:

dat %>% mutate(col2 = trimws(col2)) %>% group_by(col1,col2) %>% distinct()

Upvotes: 0

Related Questions