Aaron Nebbs
Aaron Nebbs

Reputation: 566

How to get substring of content with emoji in (in python)

Take the string:

"๐Ÿ“™ @train1 hello there"

I have the location of the @train1 in the string with an offset and length.

{
 offset: 3
 length: 7 
}

I try to get the substring from the original string with:

sub_str = msg[offset: offset + length]

However the emoji is counting as 2 chars in python so i getting:

"train1 "

instead of

"@train1"

Is there a way to get sub-strings with multi-byte characters?

Upvotes: 4

Views: 590

Answers (3)

rzlvmp
rzlvmp

Reputation: 9452

Okay here is little bit dirty way, but maybe it will help you find better solution:

Let's suppose that we have

string = "๐Ÿ“™ 123"

Where
Javascript output is: string[3] โ†’ 1
Python output is: string[3] โ†’ 2

Why it happens?

Python determining emoji like one character, but Javascript like two.

Let's see how this string looking in Javascript in escaped form:

import json

print(json.dumps(string).strip('"'))

And output will be:

# raw string will be looks like '\\ud83d\\udcd9 123'. \\ (escaped \) means that \u is not a UTF character but usual string starting with \u
\ud83d\udcd9 123

If you will try to paste this line into browser's console you will get emoji.

So if we replace \u1234 with X for example, the string length will be same as Javascript counting. Let's do it with regex:

import json
import re

new_string = re.sub('\\\\u[0-9a-f].{3}', 'X', json.dumps(string).strip('"'))
print(new_string)

And output will be XX 123, aaand voila new_string[3] will be 1. Same as Javascript.

But be carefull, this solution replace all UTF-8 bytes to X. Only ASCII characters may be parsed by this way.

Some info that may help you: 1, 2, 3

If you able to edit Javascript side, I recommend to use var chars = Array.from(string). That will allow to generate correct sequence of characters: [ "๐Ÿ“™", " ", "1", "2", "3" ]

Upvotes: 0

cuzi
cuzi

Reputation: 1097

If your data is coming from a program that uses unicode graphemes, you could use the regex library to split the string into graphemes which are grouped under \X and then use your offsets on the resulting list of graphemes:

import regex

msg = "๐Ÿ“™ @train1 hello there"
graphemes = regex.findall(r'\X', msg)
print(graphemes)
# ['๐Ÿ“™', ' ', '@', 't', 'r', 'a', 'i', 'n', '1', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']

msg = "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง @train1 hello there"
graphemes = regex.findall(r'\X', msg)
print(graphemes)
# ['๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง', ' ', '@', 't', 'r', 'a', 'i', 'n', '1', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']

Upvotes: 0

Talon
Talon

Reputation: 1894

In its string form, ๐Ÿ“™ will be one character.

In : msg[0]
Out: '๐Ÿ“™'
In : msg[1]
Out: ' '
In : msg[2]
Out: '@'
In : msg[3]
Out: 't'
In : msg[3:3+7]
Out: 'train1 '

You might be having an off-by-one error in your slicing then, as your token starts with your caret at index 2, between the and @. If your offset and length data are static, you might want to subtract 1 off the offset.


Some discussion after a comment:

It seems like you get the indexes from another source and the message does not necessarily contain one emoji, in that case this can be very non-trivial, considering there are mutli-character emojis when modifiers are active, (e.g. ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง which is 7 codepoints and 25 bytes in UTF8), symbols that use non-ascii-characters, etc. And then it depends again on how your data source is interpreting those.

You could get a list of emojis (e.g. the emoji module), lookup if characters in your message are an emoji and if so, duplicate them so your indexes fit. This will however cause trouble if that emoji is in the part you want to slice out.

On the other hand, if it's the token @trains you want, and in other messages you want other tokens like @token, you could discard the offset information and just look for words that start with @

Upvotes: 1

Related Questions