Reputation: 566
Take the string:
"๐ @train1 hello there"
I have the location of the @train1
in the string with an offset and length.
{
offset: 3
length: 7
}
I try to get the substring from the original string with:
sub_str = msg[offset: offset + length]
However the emoji is counting as 2 chars in python so i getting:
"train1 "
instead of
"@train1"
Is there a way to get sub-strings with multi-byte characters?
Upvotes: 4
Views: 590
Reputation: 9452
Okay here is little bit dirty way, but maybe it will help you find better solution:
Let's suppose that we have
string = "๐ 123"
Where
Javascript output is: string[3]
โ 1
Python output is: string[3]
โ 2
Why it happens?
Python determining emoji like one character, but Javascript like two.
Let's see how this string looking in Javascript in escaped form:
import json
print(json.dumps(string).strip('"'))
And output will be:
# raw string will be looks like '\\ud83d\\udcd9 123'. \\ (escaped \) means that \u is not a UTF character but usual string starting with \u
\ud83d\udcd9 123
If you will try to paste this line into browser's console you will get emoji.
So if we replace \u1234
with X
for example, the string length will be same as Javascript counting.
Let's do it with regex:
import json
import re
new_string = re.sub('\\\\u[0-9a-f].{3}', 'X', json.dumps(string).strip('"'))
print(new_string)
And output will be XX 123
, aaand voila new_string[3]
will be 1
. Same as Javascript.
But be carefull, this solution replace all UTF-8 bytes to X
. Only ASCII characters may be parsed by this way.
Some info that may help you: 1, 2, 3
If you able to edit Javascript side, I recommend to use var chars = Array.from(string)
. That will allow to generate correct sequence of characters: [ "๐", " ", "1", "2", "3" ]
Upvotes: 0
Reputation: 1097
If your data is coming from a program that uses unicode graphemes, you could use the regex library to split the string into graphemes which are grouped under \X
and then use your offsets on the resulting list of graphemes:
import regex
msg = "๐ @train1 hello there"
graphemes = regex.findall(r'\X', msg)
print(graphemes)
# ['๐', ' ', '@', 't', 'r', 'a', 'i', 'n', '1', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']
msg = "๐ฉโ๐ฉโ๐งโ๐ง @train1 hello there"
graphemes = regex.findall(r'\X', msg)
print(graphemes)
# ['๐ฉโ๐ฉโ๐งโ๐ง', ' ', '@', 't', 'r', 'a', 'i', 'n', '1', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']
Upvotes: 0
Reputation: 1894
In its string form, ๐
will be one character.
In : msg[0]
Out: '๐'
In : msg[1]
Out: ' '
In : msg[2]
Out: '@'
In : msg[3]
Out: 't'
In : msg[3:3+7]
Out: 'train1 '
You might be having an off-by-one error in your slicing then, as your token starts with your caret at index 2, between the
and @
. If your offset and length data are static, you might want to subtract 1 off the offset.
Some discussion after a comment:
It seems like you get the indexes from another source and the message does not necessarily contain one emoji, in that case this can be very non-trivial, considering there are mutli-character emojis when modifiers are active, (e.g. ๐ฉโ๐ฉโ๐งโ๐ง
which is 7 codepoints and 25 bytes in UTF8), symbols that use non-ascii-characters, etc. And then it depends again on how your data source is interpreting those.
You could get a list of emojis (e.g. the emoji
module), lookup if characters in your message are an emoji and if so, duplicate them so your indexes fit. This will however cause trouble if that emoji is in the part you want to slice out.
On the other hand, if it's the token @trains
you want, and in other messages you want other tokens like @token
, you could discard the offset information and just look for words that start with @
Upvotes: 1