Reputation: 1690
I am trying to split/parse comments which have strings, numbers and emojis and I want to do a generic code that can parse strings in different parts depending on the existence of an emoji in the comment.
For example:
comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"
The output should be something like:
output_1 = ["This is", "my comment"]
output_2 = ["Another comment to", "parse"]
I have been thinking that I could do a parsing with special characters only, but maybe it will leave the "O" in ":O", or the "v" in ":v"
Upvotes: 1
Views: 224
Reputation: 36390
When splitting re.split
is often useful, I would do
import re
comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"
output_1 = re.split(r'\s*\S?:\S+\s*', comment_1)
output_2 = re.split(r'\s*\S?:\S+\s*', comment_2)
print(output_1)
print(output_2)
output
['This is', 'my comment', '']
['', 'Another comment to', 'parse']
Note that this differ from your required output as there is empty str
s in outputs but these can be easily removed using list comprehension, e.g. [i for i in output_1 if i]
. r'\s*\S?:\S+\s*'
might be explained as zero or none non-whitespaces (\S?
) followed by colon (;
) and one or more non-whitespaces (\S+
) with added leading and trailing whitespaces if present (\s*
).
Upvotes: 1
Reputation: 521093
You may try matching on the pattern (?<!\S)\w+\S?(?: \w+\S?)*
, which attempts to find any sequence of all word terms, which may end in an optional non whitespace character (such as a punctuation character).
inp = ["This is :) my comment :O", ">:O Another comment to :v parse"]
for i in inp:
matches = re.findall(r'(?<!\S)\w+\S?(?: \w+\S?)*', i)
print(matches)
This prints:
['This is', 'my comment']
['Another comment to', 'parse']
Here is an explanation of the regex pattern being used:
(?<!\S) assert that what precedes the word is either whitespace
or the start of the string
\w+ match a word
\S? followed by zero or one non whitespace character
(such as punctuation symbols)
(?: \w+\S*)* zero or more word/symbol sequences following
Upvotes: 3