Connor Harris
Connor Harris

Reputation: 431

How to get unique() to work on data.tables with character columns?

If I create an R data.table with string columns without calling stringsAsFactors=TRUE and then try to take unique rows of the data table with unique, then the strings get stripped from the resulting table, though they are considered in determining which rows are unique.

> dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2), stringsAsFactors=FALSE)
> unique(dt)
   x y
1:   1
2:   2
3:   2
> dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2), stringsAsFactors=TRUE)
> unique(dt)
   x y
1: a 1
2: b 2
3: c 2

Is this correct behavior? I'm on Cygwin and have uncovered a few mysterious Cygwin-specific issues in the R internals before. Here's the readout of sessionInfo():

R version 3.4.0 (2017-04-21)
Platform: x86_64-unknown-cygwin (64-bit)
Running under: CYGWIN_NT-6.1 INT-3A02 2.8.1(0.312/5/3) 2017-07-03 14:11 x86_64 Cygwin

Matrix products: default
LAPACK: /usr/lib/R/modules/lapack.dll

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] bit_1.1-12     compiler_3.4.0 bit64_0.9-7

Upvotes: 2

Views: 227

Answers (1)

Damian
Damian

Reputation: 1433

The duplicated() function may provide a workaround. dt[!duplicated(dt), ] returns the same results as unique(dt) for both cases on my system (Ubuntu linux, R version 3.13.0-121-generic)

library(data.table)
dt <- data.table(x=factor(c('a', 'a', 'b', 'c')), y=c(1, 1, 2, 2))
all.equal(unique(dt), dt[!duplicated(dt), ])
[1] TRUE
>

dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2))
all.equal(unique(dt), dt[!duplicated(dt), ])
[1] TRUE
>

Related post: Finding ALL duplicate rows, including "elements with smaller subscripts"

Upvotes: 1

Related Questions