Hermion
Hermion

Reputation: 25

divide sentence into words using regex

i want to devide a sentence into words using regex, i'm using this code:

import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
sentence = re.split('\s|,|>|<|\[|\]:', sentence)

but i'm getting not what i'm waiting for

expected output is :

['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']

but what i'm getting is :

['', '30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', '', 'tester-test.service:', 'activation', 'successfully.']

i tried actually to ingnore the whitespace but actually it should be ignored only in the last long-word and i have no idea how can i do that.. any suggestions/help Thank you in advance

Upvotes: 2

Views: 484

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110685

It appears from the "expected output" for your example that as soon as a character is encountered that is preceded by ': ' the string comprised by that character and all that follow (to the end of the string) is to be returned. I assume that is one of the rules.

That suggests to me that you want you want to return matches (rather than the result of splitting) and that the regular expression to be matched should be a two-part alternation (that is, having the form ...|...) with the first part being

(?<=: ).+

That reads, "match one or more characters, greedily, the first being preceded by a colon followed by a space". (?<=: ) is a positive lookbehind.

Before reaching the first character that is preceded by a colon followed by a space we need to match strings comprised of digits, letters, and hyphens, and colons preceded and followed by a digit. The needed regular expression is therefore

rgx = r'(?<=: ).+|(?:[\da-zA-Z-]|(?<=\d):(?=\d))+'

You therefore may write

str = "<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully."
re.findall(rgx, str)
  #=> ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd',
  #    '1', 'tester-test.service: activation successfully.']

Python demo<-\(ツ)/->Regex demo

The components of the regular expression are as follows.

(?<=: )        # the preceding string must be ': '
.+             # match one or more characters (greedily)
|              # or
(?:            # begin a non-capture group
  [\da-zA-Z-]  # match one character in the character class
  |            # or
  (?<=\d)      # the previous character must be a digit
  :            # match literal
  (?=\d)       # the next character must be a digit
)+             # end the non-capture group and execute one or more times

(?=\d) is a positive lookahead.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

You can use

import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
chunks = sentence.split(': ', 1)
result = re.findall(r'[^][\s,<>]+', chunks[0])
result.append(chunks[1])
print(result)
# => ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']

See the Python demo

Here,

  • chunks = sentence.split(': ', 1) - splits the sentence into two chunks with the first : substring
  • result = re.findall(r'[^][\s,<>]+', chunks[0]) - extracts all substrings consisting of one or more chars other than ], [, whitespace, ,, < and > chars from the first chunk
  • result.append(chunks[1]) - append the second chunk to the result list.

Upvotes: 1

Related Questions