Reputation: 1840
I ran the below program.
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
System.out.println(t.getLength());
System.out.println(t.find("\u0041"));
System.out.println(t.find("\u00DF"));
System.out.println(t.find("\u6771"));
System.out.println(t.find("\uD801"));
System.out.println(t.find("\uD801\uDC00"));
Output
10
0
1
3
-1
6
From my understanding find returns the byteoffset in Text.
0041 -> 01000001 , 00DF - > 11011111, 6771 -> 0110011101110001
I am not able to understand the output. Also why
t.find("\uD801")
is -1 ?
Upvotes: 1
Views: 55
Reputation: 6343
This example has been explained in HADOOP The Definitive Guide book.
Text
class stores data using UTF8
encoding. Since it uses UTF8
encoding, the indexing inside a Text
is based on byte offset of UTF8 encoded characters (unlike in Java String, where the byte offset is at each character).
You can see this answer, to understand difference between Text and String in Hadoop: Difference between Text and String in Hadoop
The text: "\u0041\u00DF\u6771\uD801\uDC00", is interpreted as follows:
41
(1 byte)c3 9f
(2 bytes) e6 9d b1
(3 bytes)f0 90 90 80
(4 bytes) Following are the byte offsets, when it is stored in Text
(which is UTF-8 encoded):
41
)c3 9f
)e6 9d b1
)Finally, the last UTF-8 character (DESERET CAPITAL LETTER LONG I) occupies 4 bytes (f0 90 90 80
).
So total length is: 1 + 2 + 3 + 4 = 10.
When you do t.find("\uD801")
, you get -1. Because, no such character exists in the string, as per UTF-8 encoding.
"\uD801\uDC00" is considered as a single character (DESERET CAPITAL LETTER LONG I). Hence when you query for offset of "\uD801\uDC00", you get a proper answer of 6.
Upvotes: 1