MGJ-123
MGJ-123

Reputation: 634

How to set partial duplicates in a dataframe to null

Take this dataframe here:

id <- c(1,2,3,4,3)
id2 <- c(5,6,7,8,7)
area <- c("Area1","Area2","Area3","Area4","Area5")
test <- c("A","B","C","D","C")
df1 <- data.frame(id,id2,area,test)

If you were to group by id and id2, there is a partial duplicate. How is it possible to set the value to null for the column value that is different, to produce a dataframe like so:

id|id2|area |test|
1 |5  |Area1|A   |
2 |6  |Area2|B   |
3 |7  |NA   |C   |
4 |8  |Area4|D   |

Upvotes: 2

Views: 203

Answers (5)

hello_friend
hello_friend

Reputation: 5788

Base R using subset, duplicated, and transform:

subset(transform(df1, area = ifelse(duplicated(paste0(id, id2), fromLast = TRUE), NA, area)), 
       !(duplicated(paste0(id, id2))))

Upvotes: 0

ThomasIsCoding
ThomasIsCoding

Reputation: 101327

Here is a base R option using aggregate + replace

within(
  aggregate(. ~ id + id2, df1, unique),
  area <- replace(area, lengths(area) > 1, NA)
)

which gives

  id id2  area test
1  1   5 Area1    A
2  2   6 Area2    B
3  3   7    NA    C
4  4   8 Area4    D

Upvotes: 0

akrun
akrun

Reputation: 887088

Another option would be to create a frequency column with add_count, get the distinct rows, replace the 'area' where 'n' is greater than 1 to NA

library(dplyr)
library(tibble)
df1 %>%
     add_count(id, id2) %>% 
     distinct(id, id2, .keep_all = TRUE) %>%
     mutate(area = replace(area, n > 1, NA)) %>% 
     select(-n)
#  id id2  area test
#1  1   5 Area1    A
#2  2   6 Area2    B
#3  3   7  <NA>    C
#4  4   8 Area4    D

Upvotes: 1

caldwellst
caldwellst

Reputation: 5956

I think best option, if you want to work across a flexible number of columns, is to use dplyr::summarize to get unique values for each columns and return NA if more than 1 unique value.

library(dplyr)

get_unique <- function(x) {
  ux <- unique(x)
  if(length(ux) > 1) {
    NA
  } else {
    ux
  }
}

group_by(df1, id, id2) %>%
  summarize(across(.fns = get_unique), .groups = "drop")
#> # A tibble: 4 x 4
#>      id   id2 area  test 
#>   <dbl> <dbl> <chr> <chr>
#> 1     1     5 Area1 A    
#> 2     2     6 Area2 B    
#> 3     3     7 <NA>  C    
#> 4     4     8 Area4 D

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388972

One way would be to replace area value with NA when there are more than 1 row for id and id2. Ultimately, you want to select only one of the duplicates so you can select the first row.

library(dplyr)
df1 %>%
  group_by(id, id2) %>%
  mutate(area = if(n() > 1) NA else area) %>%
  slice(1L)

#     id   id2 area  test 
#  <dbl> <dbl> <chr> <chr>
#1     1     5 Area1 A    
#2     2     6 Area2 B    
#3     3     7 NA    C    
#4     4     8 Area4 D    

Upvotes: 0

Related Questions