Arthur
Arthur

Reputation: 1228

unique.data.table do not handle keys properly

I have this data.table, called A:

    kom eje               gad num     enc
 1: 101   1 A.C. Meyers Vænge   1   UTF-8
 2: 101   2 A.C. Meyers Vænge   1 unkwown
 3: 101   3 A.C. Meyers Vænge   1 unkwown
 4: 101   4 A.C. Meyers Vænge   1   UTF-8
 5: 101   5 A.C. Meyers Vænge   1 unkwown
 6: 101   6 A.C. Meyers Vænge   1   UTF-8
 7: 101   7 A.C. Meyers Vænge   1 unkwown
 8: 101   8 A.C. Meyers Vænge   1 unkwown
 9: 101   9 A.C. Meyers Vænge   1   UTF-8
10: 101  10 A.C. Meyers Vænge   1 unkwown
11: 101  11 A.C. Meyers Vænge  10 unkwown
12: 101  12 A.C. Meyers Vænge  11 unkwown
13: 101  13 A.C. Meyers Vænge  11   UTF-8
14: 101  14 A.C. Meyers Vænge 11A unkwown
15: 101  15 A.C. Meyers Vænge 11A   UTF-8
16: 101  16 A.C. Meyers Vænge 11A   UTF-8
17: 101  17 A.C. Meyers Vænge 11A unkwown
18: 101  18 A.C. Meyers Vænge 11A unkwown
19: 101  19 A.C. Meyers Vænge 11A   UTF-8
20: 101  20 A.C. Meyers Vænge 11A   UTF-8

A is keyed by kom, gad and num.

setkey(A,kom,gad,num)

However unique(A) returns wrongly (and without warning):

    kom eje               gad num     enc
 1: 101   1 A.C. Meyers Vænge   1   UTF-8
 2: 101   2 A.C. Meyers Vænge   1 unkwown
 3: 101   4 A.C. Meyers Vænge   1   UTF-8
 4: 101   5 A.C. Meyers Vænge   1 unkwown
 5: 101   6 A.C. Meyers Vænge   1   UTF-8
 6: 101   7 A.C. Meyers Vænge   1 unkwown
 7: 101   9 A.C. Meyers Vænge   1   UTF-8
 8: 101  10 A.C. Meyers Vænge   1 unkwown
 9: 101  11 A.C. Meyers Vænge  10 unkwown
10: 101  12 A.C. Meyers Vænge  11 unkwown
11: 101  13 A.C. Meyers Vænge  11   UTF-8
12: 101  14 A.C. Meyers Vænge 11A unkwown
13: 101  15 A.C. Meyers Vænge 11A   UTF-8
14: 101  17 A.C. Meyers Vænge 11A unkwown
15: 101  19 A.C. Meyers Vænge 11A   UTF-8

Since A i keyed, I would expect unique to focus on only these columns, as specified in unique.data.table documentation. Clearly lines 1 and 2, and 4 and 5, are errors. This appear to give the correct answer:

B <- A[.(101,'A.C. Meyers Vænge')] # warning about encoding
unique(B)
   kom               gad eje num     enc
1: 101 A.C. Meyers Vænge   9   1   UTF-8
2: 101 A.C. Meyers Vænge  11  10 unkwown
3: 101 A.C. Meyers Vænge  12  11 unkwown
4: 101 A.C. Meyers Vænge  14 11A unkwown

but it is actually by chance since B is different from A (this should not be the case since A contains only kom==101 and gad=='A.C. Meyers Vænge' observations):

   kom               gad eje num     enc
1: 101 A.C. Meyers Vænge   9   1   UTF-8
2: 101 A.C. Meyers Vænge  10   1 unkwown
3: 101 A.C. Meyers Vænge  11  10 unkwown
4: 101 A.C. Meyers Vænge  12  11 unkwown
5: 101 A.C. Meyers Vænge  13  11   UTF-8
6: 101 A.C. Meyers Vænge  14 11A unkwown
7: 101 A.C. Meyers Vænge  15 11A   UTF-8
8: 101 A.C. Meyers Vænge  16 11A   UTF-8

What is happening here?

EDIT: get A-like data

A <- data.table(
     kom = rep(101L,20),
     eje = 1L:20L,
     gad = rep("A.C. Meyers Vænge",20),
     num = rep(c('1','10','11','11A'),times=c(10,1,2,7)),
     enc = sample(c('unkwown','UTF-8'), 20, replace=TRUE)
)
Encoding(A$gad) <- A$enc

Upvotes: 0

Views: 100

Answers (2)

Arun
Arun

Reputation: 118889

With this recent commit, data.table now takes care of these mixed encodings implicitly by ensuring proper encodings while creating data.tables, as well as by ensuring proper encodings in functions like unique() and duplicated().

See news item (23) under bugs for v1.9.7 in README.md.

Please test and write back if you face any further issues.


Now I get this:

require(data.table) # v1.9.7, commit 2096+
set.seed(2L)
A <- data.table(
     kom = rep(101L,20),
     eje = 1L:20L,
     gad = rep("A.C. Meyers Vænge",20),
     num = rep(c('1','10','11','11A'),times=c(10,1,2,7)),
     enc = sample(c('unkwown','UTF-8'), 20, replace=TRUE)
)
Encoding(A$gad) <- A$enc
setkey(A,kom,gad,num)
> unique(A)
#    kom eje               gad num     enc
# 1: 101   1 A.C. Meyers Vænge   1 unkwown
# 2: 101  11 A.C. Meyers Vænge  10   UTF-8
# 3: 101  12 A.C. Meyers Vænge  11 unkwown
# 4: 101  14 A.C. Meyers Vænge 11A unkwown

Upvotes: 1

Arthur
Arthur

Reputation: 1228

As noted Arun, it is a mixed-encoding matter. Indeed, converting the column to a unique encoding with enc2native() makes unique() works properly:

A$gad2 <- enc2native(A$gad)
setkey(A,kom,gad2,num)
unique(A)

BUT I never wanted to get mixed encoding (and never specified such a thing in my code). Actually, even with native encoding on gad2, A[.(101,'A.C. Meyers Vænge')] still gives a warning! The reason is simply:

Encoding(c('a','æ'))
[1] "unknown" "UTF-8"

As a consequence, A[.(101,'A.C. Meyers Vænge')] rises a warning (and a false answer), since a UTF-8 string, A.C. Meyers Vænge, is compared to a native-encoded column, gad2. One might force the characters to native encoding with A[.(101,enc2native('A.C. Meyers Vænge'))] but this seems pretty far-fetched to me.

I really do not get the point of this mixed encoding as a default!

Upvotes: 1

Related Questions