Rafael Díaz
Rafael Díaz

Reputation: 2289

NA value by Unique id

I have a dataframe with two columns. I want the "id" column to be unique, and the value for the non-repeated id put the same value, and for the repeated id that the value is NA.

library(data.table)

DT <- data.table(id = c(1,2,3,3,4,5,5), value = c(17,13,8,NA,9,NA,11))
DT
   id value
1:  1    17
2:  2    13
3:  3     8
4:  3    NA
5:  4     9
6:  5    NA
7:  5    11

Expected output

   id value
1:  1    17
2:  2    13
3:  3    NA
4:  4     9
5:  5    NA

Upvotes: 3

Views: 100

Answers (3)

tmfmnk
tmfmnk

Reputation: 39858

I see that you are mainly interested in data.table solutions, but for completeness, one dplyr possibility could be:

DT %>%
 group_by(id) %>%
 slice(which.max(is.na(value)))

     id value
  <dbl> <dbl>
1     1    17
2     2    13
3     3    NA
4     4     9
5     5    NA

Upvotes: 1

fmarm
fmarm

Reputation: 4284

This should do the trick, get the min value by id, if there is a NA, NA will be returned

DT[,.(value=min(value)),.(id)]

Edit: timed this solution and @mt1022 solution on a 100 million rows datatable, times are similar

library(data.table)
set.seed(1)
DT <- data.table(id = sample(1:1e6,1e8,replace=TRUE), 
                  value = ifelse(runif(1e7) < 0.99,
                                 sample(1:1e6,1e8,replace=TRUE),
                                        NA))

 # my solution with min
 ptm <- proc.time()
 DT[,.(value=min(value)),.(id)]
 proc.time() - ptm

 #   user  system elapsed 
 #   6.34    1.67    2.89 

 # mt1022's solution
 ptm <- proc.time()
 DT[, .(value = if(.N == 1) value else NA_real_), by = .(id)]
 proc.time() - ptm


 #   user  system elapsed 
 #   6.61    1.35    4.61 

Upvotes: 4

mt1022
mt1022

Reputation: 17289

Here is one option:

> DT[, .(value = if(.N == 1) value else NA_real_), by = .(id)]
   id value
1:  1    17
2:  2    13
3:  3    NA
4:  4     9
5:  5    NA

Upvotes: 6

Related Questions