Reputation: 1003
I have a text file with the following content:
🔴🔴🔴🔴🔴\n==================\0No. 4♨ ==\n📌 \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\[email protected]
I have a python code running in the server to find the indices which I want to pass with the text for the highlighting purposes on the client. Following is the code for that:
import re
f = open('data.json', 'r')
text = f.readline().strip().decode('UTF-8').encode('UTF-8')
f.close()
for m in re.finditer(r'emailaddress', text, flags=re.IGNORECASE):
s = m.start()
e = m.end()
print s, e
print text[s:e]
The output is:
123 135
emailaddress
Now on the client side, I have the java code (on android). HOwever these indices dont work at all.
public class HelloWorld {
public static void main(String[] args) {
String text = "🔴🔴🔴🔴🔴\n==================\0No. 4♨ ==\n📌 \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\[email protected]";
System.out.println(text.substring(**115**));
}
}
And the output is:
l.com
I am sure I am making some mistake in the encoding of the strings. Can someone help me with that.
Upvotes: 0
Views: 613
Reputation: 1122142
The Python side works with UTF-8 encoded data (which vary in size), the Java code with UTF-16 codeunits*. Indices into one do not map into the other.
You can see the issue when applying the index to your sample string, both as Unicode string and encoded to UTF-8, in a Python 2.7 UCS-2 build (which uses UTF-16 surrogate pairs like Java does):
>>> u"🔴🔴🔴🔴🔴\n==================\0No. 4♨ ==\n📌 \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\[email protected]"[115:]
u'l.com'
>>> u"🔴🔴🔴🔴🔴\n==================\0No. 4♨ ==\n📌 \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\[email protected]".encode('utf8')[115:]
'\[email protected]'
UTF-8 encodes Unicode codepoints to 1 and 4 codeunits per codepoint; how many codeunits are used then depends on the text:
>>> len(u'abc'.encode('utf8'))
3
>>> len(u'åßç'.encode('utf8'))
6
while decoding Unicode to an internal UTF-16 representation (like Java does, and Python 2.7 with the default narrow UCS-2 build), most characters use just the one codeunit, while characters outside of the BMP (like emoticons) use 2:
>>> u"🔴📌✅"
u'\U0001f534\U0001f4cc\u2705'
>>> len(u"🔴📌✅")
5
>>> u"🔴📌✅".encode('utf8')
'\xf0\x9f\x94\xb4\xf0\x9f\x93\x8c\xe2\x9c\x85'
>>> len(u"🔴📌✅".encode('utf8'))
11
Either run your regex on a Unicode value in Python (e.g. decode from UTF-8) or alter the Java code to operate on UTF-8 bytes rather than UTF-16 codeunits.
If you are using Unicode in Python, do take into account that you can also build the Python binary using UCS-4 for Unicode codepoints; you'd never see surrogates and the length of the string in Python will differ from that of the Java representation. Python 3.3 and up use a flexible storage where the internal representation will never use surrogates but instead scales to meet the requirements for each individual string.
In that case you may need to use JSR-204 methods to access codepoints on the Java side; I suspect that String.offsetByCodePoints()
would be helpful here but I am not a Java developer.
You may want to brush up on Unicode and codecs; I recommend you read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
* Java's String type uses UTF-16 words, which are 2 bytes per codeunit. For characters outside the BMP, that means two codeunits are used per character using surrogate pairs.
Upvotes: 3