replace duplicate values with NA in time series data using dplyr

Question

My data seems a bit different than other similar kind of posts.

box_num      date       x        y
1-Q      2018-11-18   20.2      8
1-Q      2018-11-25   21.23     7.2
1-Q      2018-12-2    21.23     23
98-L     2018-11-25   0.134     9.3
98-L     2018-12-2    0.134     4
76-GI    2018-12-2    22.734    4.562
76-GI    2018-12-9    28        4.562

Here I would like to replace the repeated values with NA in both x and y columns. The code I have tried using dplyr :

(1)df <- df %>% group_by(box_num) %>% arrange(box_num,date) %>%
  mutate(df$x[duplicated(df$x),] <- NA)

It creates a new column with all NA's instead of just replacing a repeated value with NA

 (2)df <- df %>% group_by(box_num) %>% arrange(box_num,date) %>%  
distinct(x,.keep_all = TRUE)

The second one just gives the rows that are not duplicated(we are missing the time series) Desired Output :

box_num      date       x        y
    1-Q      2018-11-18   20.2      8
    1-Q      2018-11-25   21.23     7.2
    1-Q      2018-12-2    NA        23
    98-L     2018-11-25   0.134     9.3
    98-L     2018-12-2    NA        4
    76-GI    2018-12-2    22.734    4.562
    76-GI    2018-12-9    28        NA

Ronak Shah · Accepted Answer

Using dplyr we can group_by box_num and use mutate_at x and y column and replace the duplicated value by NA.

library(dplyr)

df %>%
  group_by(box_num) %>%
  mutate_at(vars(x:y), funs(replace(., duplicated(.), NA)))


# box_num date          x     y
#            
#1 1-Q     2018-11-18 20.2    8   
#2 1-Q     2018-11-25 21.2    7.2 
#3 1-Q     2018-12-2  NA     23   
#4 98-L    2018-11-25  0.134  9.3 
#5 98-L    2018-12-2  NA      4   
#6 76-GI   2018-12-2  22.7    4.56
#7 76-GI   2018-12-9  28     NA

A base R option (which might not be the best in this case) would be :

cols <- c("x", "y")
df[cols] <- sapply(df[cols], function(x) 
            ave(x, df$box_num, FUN = function(x) replace(x, duplicated(x), NA)))

replace duplicate values with NA in time series data using dplyr

Answers (2)

data

Related Questions