Reputation: 966

Split word containing unicode character

I am working on an NLP project involving emojis in tweets.

An example of the tweets is given here:
"sometimes i wish i wa an octopus so i could slap 8 people at once🐙"

My problem is that once🐙 is considered as one word so I would like to split that unique word into two so that my tweet look like this:
"sometimes i wish i wa an octopus so i could slap 8 people at once 🐙"

Note that I already have the compiled regexp containing each emojis!

I am looking for an efficient way of doing that since I have hundreds of thousands tweets but I can't figure out where to start.

Thank you

Upvotes: 3

Answers (2)

heemayl

Reputation: 42017

You can use re.sub to introduce a space:

re.sub(r'(\W+)(?= |$)', r' \1', string)

Example:

>>> string
'sometimes i wish i wa an octopus so i could slap 8 people at once\xf0\x9f\x90\x99'
>>> re.sub(r'(\W+)(?= |$)', r' \1', string)
'sometimes i wish i wa an octopus so i could slap 8 people at once \xf0\x9f\x90\x99'

>>> string = 'sometimes i wish i wa an octopus so i could slap 8 people at once🐙" foobar'
>>> re.sub(r'(\W+)(?= |$)', r' \1', string)
'sometimes i wish i wa an octopus so i could slap 8 people at once \xf0\x9f\x90\x99 foobar'

Upvotes: 1

L3viathan

Reputation: 27283

Can't you just do something like this:

>>> import re
>>> s = "sometimes i wish i wa an octopus so i could slap 8 people at once🐙"
>>> re.findall("(\w+|[^\w ]+)",s)
['sometimes', 'i', 'wish', 'i', 'wa', 'an', 'octopus', 'so', 'i', 'could', 'slap', '8', 'people', 'at', 'once', '🐙']

If you need them as a single space-delimited string again, just join them:

>>> " ".join(re.findall("(\w+|[^\w ]+)",s))
'sometimes i wish i wa an octopus so i could slap 8 people at once 🐙'

edit: fixed.

Upvotes: 2

Split word containing unicode character

Answers (2)

Related Questions