The Dan
The Dan

Reputation: 1690

Python parse comment by non string characters

I am trying to split/parse comments which have strings, numbers and emojis and I want to do a generic code that can parse strings in different parts depending on the existence of an emoji in the comment.

For example:

comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"

The output should be something like:

output_1 = ["This is", "my comment"]
output_2 = ["Another comment to", "parse"]

I have been thinking that I could do a parsing with special characters only, but maybe it will leave the "O" in ":O", or the "v" in ":v"

Upvotes: 1

Views: 224

Answers (2)

Daweo
Daweo

Reputation: 36390

When splitting re.split is often useful, I would do

import re
comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"
output_1 = re.split(r'\s*\S?:\S+\s*', comment_1)
output_2 = re.split(r'\s*\S?:\S+\s*', comment_2)
print(output_1)
print(output_2)

output

['This is', 'my comment', '']
['', 'Another comment to', 'parse']

Note that this differ from your required output as there is empty strs in outputs but these can be easily removed using list comprehension, e.g. [i for i in output_1 if i]. r'\s*\S?:\S+\s*' might be explained as zero or none non-whitespaces (\S?) followed by colon (;) and one or more non-whitespaces (\S+) with added leading and trailing whitespaces if present (\s*).

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521093

You may try matching on the pattern (?<!\S)\w+\S?(?: \w+\S?)*, which attempts to find any sequence of all word terms, which may end in an optional non whitespace character (such as a punctuation character).

inp = ["This is :) my comment :O", ">:O Another comment to :v parse"]
for i in inp:
    matches = re.findall(r'(?<!\S)\w+\S?(?: \w+\S?)*', i)
    print(matches)

This prints:

['This is', 'my comment']
['Another comment to', 'parse']

Here is an explanation of the regex pattern being used:

(?<!\S)       assert that what precedes the word is either whitespace
              or the start of the string
\w+           match a word
\S?           followed by zero or one non whitespace character
              (such as punctuation symbols)
(?: \w+\S*)*  zero or more word/symbol sequences following

Upvotes: 3

Related Questions