Reputation: 966
I am working on an NLP project involving emojis in tweets.
An example of the tweets is given here:
"sometimes i wish i wa an octopus so i could slap 8 people at once๐"
My problem is that once๐
is considered as one word so I would like to split that unique word into two so that my tweet look like this:
"sometimes i wish i wa an octopus so i could slap 8 people at once ๐"
Note that I already have the compiled regexp containing each emojis!
I am looking for an efficient way of doing that since I have hundreds of thousands tweets but I can't figure out where to start.
Thank you
Upvotes: 3
Views: 252
Reputation: 42017
You can use re.sub
to introduce a space:
re.sub(r'(\W+)(?= |$)', r' \1', string)
Example:
>>> string
'sometimes i wish i wa an octopus so i could slap 8 people at once\xf0\x9f\x90\x99'
>>> re.sub(r'(\W+)(?= |$)', r' \1', string)
'sometimes i wish i wa an octopus so i could slap 8 people at once \xf0\x9f\x90\x99'
>>> string = 'sometimes i wish i wa an octopus so i could slap 8 people at once๐" foobar'
>>> re.sub(r'(\W+)(?= |$)', r' \1', string)
'sometimes i wish i wa an octopus so i could slap 8 people at once \xf0\x9f\x90\x99 foobar'
Upvotes: 1
Reputation: 27283
Can't you just do something like this:
>>> import re
>>> s = "sometimes i wish i wa an octopus so i could slap 8 people at once๐"
>>> re.findall("(\w+|[^\w ]+)",s)
['sometimes', 'i', 'wish', 'i', 'wa', 'an', 'octopus', 'so', 'i', 'could', 'slap', '8', 'people', 'at', 'once', '๐']
If you need them as a single space-delimited string again, just join them:
>>> " ".join(re.findall("(\w+|[^\w ]+)",s))
'sometimes i wish i wa an octopus so i could slap 8 people at once ๐'
edit: fixed.
Upvotes: 2