Reputation: 858
I have a large dataframe that is simplified below. Given the following data frame structure, I need to collapse to return two distinct rows where col3 has different values but col1 and col2 have unique values.
dat <- data.frame("col1" = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
"col2" = c( "A","A", "A", "A", "A", "A", "A", "A", "A", "A"," A", "A", "A", "A", "A"),
"col3" = c( "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Y", "Y", "Y", "Y", "Y", "Y"))
col1 col2 col3
1 1 A Z
2 1 A Z
3 1 A Z
4 1 A Z
5 1 A Z
6 1 A Z
7 1 A Z
8 1 A Z
9 1 A Z
10 1 A Y
11 1 A Y
12 1 A Y
13 1 A Y
14 1 A Y
15 1 A Y
So in this case I would need to return just the following:
col1 col2 col3
1 A Z
1 A Y
If however, col3 was only z's I would return no rows. I can get counts of these data with the table function but I need to see the actual rows. Any ideas?
Thanks
Upvotes: 0
Views: 72
Reputation: 10483
Try this with:
library(dplyr)
dat %>%
group_by(col1, col2) %>%
filter(length(unique(col3)) > 1) %>%
distinct()
If dat
is as follows:
dat <- structure(list(col1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1), col2 = c("A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A"), col3 = c("Z", "Z", "Z", "Z", "Z",
"Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z")), class = "data.frame", row.names = c(NA,
-15L))
You get no rows as follows:
# A tibble: 0 x 3
# Groups: col1, col2 [0]
# ... with 3 variables: col1 <dbl>, col2 <chr>, col3 <chr>
If dat
is as you provided in the original post, you get the output as you needed:
# A tibble: 2 x 3
# Groups: col1, col2 [1]
col1 col2 col3
<dbl> <chr> <chr>
1 1 A Z
2 1 A Y
Notice that I am using length(unique())
in the filter instead of n_distinct
because there is a dplyr
bug that makes n_distinct
in a filter of a grouped data frame run extremely slowly.
Upvotes: 1
Reputation: 7292
Nice easy dplyr
solution:
dat <- structure(list(col1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1), col2 = c("A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A"), col3 = c("Z", "Z", "Z", "Z", "Z",
"Z", "Z", "Z", "Z", "Y", "Y", "Y", "Y", "Y", "Y")), class = "data.frame", row.names = c(NA,
-15L))
library(dplyr)
dat %>% group_by(col1,col2) %>% distinct()
# A tibble: 2 x 3
# Groups: col1, col2 [1]
col1 col2 col3
<dbl> <chr> <chr>
1 1 A Z
2 1 A Y
In your version of dat
you have a space in one of your col2
values, if that's not a typo, you'd need to fix that first so that distinct()
aggregates correctly:
dat %>% mutate(col2 = trimws(col2)) %>% group_by(col1,col2) %>% distinct()
Upvotes: 0