Reputation: 634
Take this dataframe here:
id <- c(1,2,3,4,3)
id2 <- c(5,6,7,8,7)
area <- c("Area1","Area2","Area3","Area4","Area5")
test <- c("A","B","C","D","C")
df1 <- data.frame(id,id2,area,test)
If you were to group by id and id2, there is a partial duplicate. How is it possible to set the value to null for the column value that is different, to produce a dataframe like so:
id|id2|area |test|
1 |5 |Area1|A |
2 |6 |Area2|B |
3 |7 |NA |C |
4 |8 |Area4|D |
Upvotes: 2
Views: 203
Reputation: 5788
Base R using subset
, duplicated
, and transform
:
subset(transform(df1, area = ifelse(duplicated(paste0(id, id2), fromLast = TRUE), NA, area)),
!(duplicated(paste0(id, id2))))
Upvotes: 0
Reputation: 101327
Here is a base R option using aggregate
+ replace
within(
aggregate(. ~ id + id2, df1, unique),
area <- replace(area, lengths(area) > 1, NA)
)
which gives
id id2 area test
1 1 5 Area1 A
2 2 6 Area2 B
3 3 7 NA C
4 4 8 Area4 D
Upvotes: 0
Reputation: 887088
Another option would be to create a frequency column with add_count
, get the distinct
rows, replace
the 'area' where 'n' is greater than 1 to NA
library(dplyr)
library(tibble)
df1 %>%
add_count(id, id2) %>%
distinct(id, id2, .keep_all = TRUE) %>%
mutate(area = replace(area, n > 1, NA)) %>%
select(-n)
# id id2 area test
#1 1 5 Area1 A
#2 2 6 Area2 B
#3 3 7 <NA> C
#4 4 8 Area4 D
Upvotes: 1
Reputation: 5956
I think best option, if you want to work across a flexible number of columns, is to use dplyr::summarize
to get unique values for each columns and return NA if more than 1 unique value.
library(dplyr)
get_unique <- function(x) {
ux <- unique(x)
if(length(ux) > 1) {
NA
} else {
ux
}
}
group_by(df1, id, id2) %>%
summarize(across(.fns = get_unique), .groups = "drop")
#> # A tibble: 4 x 4
#> id id2 area test
#> <dbl> <dbl> <chr> <chr>
#> 1 1 5 Area1 A
#> 2 2 6 Area2 B
#> 3 3 7 <NA> C
#> 4 4 8 Area4 D
Upvotes: 1
Reputation: 388972
One way would be to replace area
value with NA
when there are more than 1 row for id
and id2
. Ultimately, you want to select only one of the duplicates so you can select the first row.
library(dplyr)
df1 %>%
group_by(id, id2) %>%
mutate(area = if(n() > 1) NA else area) %>%
slice(1L)
# id id2 area test
# <dbl> <dbl> <chr> <chr>
#1 1 5 Area1 A
#2 2 6 Area2 B
#3 3 7 NA C
#4 4 8 Area4 D
Upvotes: 0