Reputation: 1228
I have this data.table
, called A
:
kom eje gad num enc
1: 101 1 A.C. Meyers Vænge 1 UTF-8
2: 101 2 A.C. Meyers Vænge 1 unkwown
3: 101 3 A.C. Meyers Vænge 1 unkwown
4: 101 4 A.C. Meyers Vænge 1 UTF-8
5: 101 5 A.C. Meyers Vænge 1 unkwown
6: 101 6 A.C. Meyers Vænge 1 UTF-8
7: 101 7 A.C. Meyers Vænge 1 unkwown
8: 101 8 A.C. Meyers Vænge 1 unkwown
9: 101 9 A.C. Meyers Vænge 1 UTF-8
10: 101 10 A.C. Meyers Vænge 1 unkwown
11: 101 11 A.C. Meyers Vænge 10 unkwown
12: 101 12 A.C. Meyers Vænge 11 unkwown
13: 101 13 A.C. Meyers Vænge 11 UTF-8
14: 101 14 A.C. Meyers Vænge 11A unkwown
15: 101 15 A.C. Meyers Vænge 11A UTF-8
16: 101 16 A.C. Meyers Vænge 11A UTF-8
17: 101 17 A.C. Meyers Vænge 11A unkwown
18: 101 18 A.C. Meyers Vænge 11A unkwown
19: 101 19 A.C. Meyers Vænge 11A UTF-8
20: 101 20 A.C. Meyers Vænge 11A UTF-8
A
is keyed by kom
, gad
and num
.
setkey(A,kom,gad,num)
However unique(A)
returns wrongly (and without warning):
kom eje gad num enc
1: 101 1 A.C. Meyers Vænge 1 UTF-8
2: 101 2 A.C. Meyers Vænge 1 unkwown
3: 101 4 A.C. Meyers Vænge 1 UTF-8
4: 101 5 A.C. Meyers Vænge 1 unkwown
5: 101 6 A.C. Meyers Vænge 1 UTF-8
6: 101 7 A.C. Meyers Vænge 1 unkwown
7: 101 9 A.C. Meyers Vænge 1 UTF-8
8: 101 10 A.C. Meyers Vænge 1 unkwown
9: 101 11 A.C. Meyers Vænge 10 unkwown
10: 101 12 A.C. Meyers Vænge 11 unkwown
11: 101 13 A.C. Meyers Vænge 11 UTF-8
12: 101 14 A.C. Meyers Vænge 11A unkwown
13: 101 15 A.C. Meyers Vænge 11A UTF-8
14: 101 17 A.C. Meyers Vænge 11A unkwown
15: 101 19 A.C. Meyers Vænge 11A UTF-8
Since A
i keyed, I would expect unique
to focus on only these columns, as specified in unique.data.table
documentation. Clearly lines 1 and 2, and 4 and 5, are errors. This appear to give the correct answer:
B <- A[.(101,'A.C. Meyers Vænge')] # warning about encoding
unique(B)
kom gad eje num enc
1: 101 A.C. Meyers Vænge 9 1 UTF-8
2: 101 A.C. Meyers Vænge 11 10 unkwown
3: 101 A.C. Meyers Vænge 12 11 unkwown
4: 101 A.C. Meyers Vænge 14 11A unkwown
but it is actually by chance since B is different from A (this should not be the case since A contains only kom==101
and gad=='A.C. Meyers Vænge'
observations):
kom gad eje num enc
1: 101 A.C. Meyers Vænge 9 1 UTF-8
2: 101 A.C. Meyers Vænge 10 1 unkwown
3: 101 A.C. Meyers Vænge 11 10 unkwown
4: 101 A.C. Meyers Vænge 12 11 unkwown
5: 101 A.C. Meyers Vænge 13 11 UTF-8
6: 101 A.C. Meyers Vænge 14 11A unkwown
7: 101 A.C. Meyers Vænge 15 11A UTF-8
8: 101 A.C. Meyers Vænge 16 11A UTF-8
What is happening here?
EDIT: get A
-like data
A <- data.table(
kom = rep(101L,20),
eje = 1L:20L,
gad = rep("A.C. Meyers Vænge",20),
num = rep(c('1','10','11','11A'),times=c(10,1,2,7)),
enc = sample(c('unkwown','UTF-8'), 20, replace=TRUE)
)
Encoding(A$gad) <- A$enc
Upvotes: 0
Views: 100
Reputation: 118889
With this recent commit, data.table now takes care of these mixed encodings implicitly by ensuring proper encodings while creating data.tables, as well as by ensuring proper encodings in functions like unique()
and duplicated()
.
See news item (23) under bugs for v1.9.7 in README.md.
Please test and write back if you face any further issues.
Now I get this:
require(data.table) # v1.9.7, commit 2096+
set.seed(2L)
A <- data.table(
kom = rep(101L,20),
eje = 1L:20L,
gad = rep("A.C. Meyers Vænge",20),
num = rep(c('1','10','11','11A'),times=c(10,1,2,7)),
enc = sample(c('unkwown','UTF-8'), 20, replace=TRUE)
)
Encoding(A$gad) <- A$enc
setkey(A,kom,gad,num)
> unique(A)
# kom eje gad num enc
# 1: 101 1 A.C. Meyers Vænge 1 unkwown
# 2: 101 11 A.C. Meyers Vænge 10 UTF-8
# 3: 101 12 A.C. Meyers Vænge 11 unkwown
# 4: 101 14 A.C. Meyers Vænge 11A unkwown
Upvotes: 1
Reputation: 1228
As noted Arun, it is a mixed-encoding matter. Indeed, converting the column to a unique encoding with enc2native()
makes unique()
works properly:
A$gad2 <- enc2native(A$gad)
setkey(A,kom,gad2,num)
unique(A)
BUT I never wanted to get mixed encoding (and never specified such a thing in my code). Actually, even with native encoding on gad2
, A[.(101,'A.C. Meyers Vænge')]
still gives a warning! The reason is simply:
Encoding(c('a','æ'))
[1] "unknown" "UTF-8"
As a consequence, A[.(101,'A.C. Meyers Vænge')]
rises a warning (and a false answer), since a UTF-8 string, A.C. Meyers Vænge
, is compared to a native-encoded column, gad2
. One might force the characters to native encoding with A[.(101,enc2native('A.C. Meyers Vænge'))]
but this seems pretty far-fetched to me.
I really do not get the point of this mixed encoding as a default!
Upvotes: 1