Pankaj Garg
Pankaj Garg

Reputation: 1003

utf-8 string indices in python not compatible in java

I have a text file with the following content:

 🔴🔴🔴🔴🔴\n==================\0No. 4♨ ==\n📌 \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\[email protected]

I have a python code running in the server to find the indices which I want to pass with the text for the highlighting purposes on the client. Following is the code for that:

import re
f = open('data.json', 'r')
text = f.readline().strip().decode('UTF-8').encode('UTF-8')
f.close()

for m in re.finditer(r'emailaddress', text, flags=re.IGNORECASE): 
    s = m.start()
    e = m.end()
    print s, e
    print text[s:e]

The output is:

123 135
emailaddress

Now on the client side, I have the java code (on android). HOwever these indices dont work at all.

public class HelloWorld {
    public static void main(String[] args) {
        String text = "🔴🔴🔴🔴🔴\n==================\0No. 4♨ ==\n📌 \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\[email protected]";
        System.out.println(text.substring(**115**));
    }
}

And the output is:

l.com

I am sure I am making some mistake in the encoding of the strings. Can someone help me with that.

Upvotes: 0

Views: 613

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122142

The Python side works with UTF-8 encoded data (which vary in size), the Java code with UTF-16 codeunits*. Indices into one do not map into the other.

You can see the issue when applying the index to your sample string, both as Unicode string and encoded to UTF-8, in a Python 2.7 UCS-2 build (which uses UTF-16 surrogate pairs like Java does):

>>> u"🔴🔴🔴🔴🔴\n==================\0No. 4♨ ==\n📌 \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\[email protected]"[115:]
u'l.com'
>>> u"🔴🔴🔴🔴🔴\n==================\0No. 4♨ ==\n📌 \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\[email protected]".encode('utf8')[115:]
'\[email protected]'

UTF-8 encodes Unicode codepoints to 1 and 4 codeunits per codepoint; how many codeunits are used then depends on the text:

>>> len(u'abc'.encode('utf8'))
3
>>> len(u'åßç'.encode('utf8'))
6

while decoding Unicode to an internal UTF-16 representation (like Java does, and Python 2.7 with the default narrow UCS-2 build), most characters use just the one codeunit, while characters outside of the BMP (like emoticons) use 2:

>>> u"🔴📌✅"
u'\U0001f534\U0001f4cc\u2705'
>>> len(u"🔴📌✅")
5
>>> u"🔴📌✅".encode('utf8')
'\xf0\x9f\x94\xb4\xf0\x9f\x93\x8c\xe2\x9c\x85'
>>> len(u"🔴📌✅".encode('utf8'))
11

Either run your regex on a Unicode value in Python (e.g. decode from UTF-8) or alter the Java code to operate on UTF-8 bytes rather than UTF-16 codeunits.

If you are using Unicode in Python, do take into account that you can also build the Python binary using UCS-4 for Unicode codepoints; you'd never see surrogates and the length of the string in Python will differ from that of the Java representation. Python 3.3 and up use a flexible storage where the internal representation will never use surrogates but instead scales to meet the requirements for each individual string.

In that case you may need to use JSR-204 methods to access codepoints on the Java side; I suspect that String.offsetByCodePoints() would be helpful here but I am not a Java developer.

You may want to brush up on Unicode and codecs; I recommend you read:


* Java's String type uses UTF-16 words, which are 2 bytes per codeunit. For characters outside the BMP, that means two codeunits are used per character using surrogate pairs.

Upvotes: 3

Related Questions