Reputation: 960
I have a data.table, and I need to generate another data.table that lists only unique values of each column. An example:
From
> sourceDT <- data.table(ID = c(1,2,3,4), date = c(ymd("20110101"),ymd("20110101"),ymd("20130101"),ymd("20150101")), text = c("A","B","C","C"))
> sourceDT
ID date text
1: 1 2011-01-01 A
2: 2 2011-01-01 B
3: 3 2013-01-01 C
4: 4 2015-01-01 C
To
> outputDT <- data.table(ID = c(1,2,3,4),date = c(ymd("20110101"),ymd("20130101"),ymd("20150101"),NA), text = c("A","B","C",NA))
> outputDT
ID date text
1: 1 2011-01-01 A
2: 2 2013-01-01 B
3: 3 2015-01-01 C
4: 4 <NA> NA
What's the most efficient way of achieving this?
Upvotes: 1
Views: 383
Reputation: 79228
one way you can do: (Though I guess not very Efficient:)
sourceDT[,lapply(.SD, function(x)sort(`is.na<-`(x,duplicated(x)),na.last = T))]
ID date text
1: 1 2011-01-01 A
2: 2 2013-01-01 B
3: 3 2015-01-01 C
4: 4 <NA> NA
Upvotes: 2
Reputation: 4671
This is what I've understood your question to be, the demo you have doesn't quite match what you explained in response to my comment.
This finds duplicates in each column and sets any to NA, this preserves the data.table
size.
library(data.table)
library(lubridate)
sourceDT <- data.table(ID = c(1,2,3,4), date = c(ymd("20110101"),ymd("20110101"),ymd("20130101"),ymd("20150101")), text = c("A","B","C","C"))
for (i in seq_along(sourceDT)) {
dupes <- which(duplicated(sourceDT[[i]]))
if (length(dupes > 0)) {
set(sourceDT, dupes, i, NA)
}
}
# ID date text
# 1: 1 2011-01-01 A
# 2: 2 <NA> B
# 3: 3 2013-01-01 C
# 4: 4 2015-01-01 NA
Upvotes: 1