Reputation: 25
i want to devide a sentence into words using regex, i'm using this code:
import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
sentence = re.split('\s|,|>|<|\[|\]:', sentence)
but i'm getting not what i'm waiting for
expected output is :
['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']
but what i'm getting is :
['', '30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', '', 'tester-test.service:', 'activation', 'successfully.']
i tried actually to ingnore the whitespace but actually it should be ignored only in the last long-word and i have no idea how can i do that.. any suggestions/help Thank you in advance
Upvotes: 2
Views: 484
Reputation: 110685
It appears from the "expected output" for your example that as soon as a character is encountered that is preceded by ': '
the string comprised by that character and all that follow (to the end of the string) is to be returned. I assume that is one of the rules.
That suggests to me that you want you want to return matches (rather than the result of splitting) and that the regular expression to be matched should be a two-part alternation (that is, having the form ...|...
) with the first part being
(?<=: ).+
That reads, "match one or more characters, greedily, the first being preceded by a colon followed by a space". (?<=: )
is a positive lookbehind.
Before reaching the first character that is preceded by a colon followed by a space we need to match strings comprised of digits, letters, and hyphens, and colons preceded and followed by a digit. The needed regular expression is therefore
rgx = r'(?<=: ).+|(?:[\da-zA-Z-]|(?<=\d):(?=\d))+'
You therefore may write
str = "<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully."
re.findall(rgx, str)
#=> ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd',
# '1', 'tester-test.service: activation successfully.']
Python demo<-\(ツ)/->Regex demo
The components of the regular expression are as follows.
(?<=: ) # the preceding string must be ': '
.+ # match one or more characters (greedily)
| # or
(?: # begin a non-capture group
[\da-zA-Z-] # match one character in the character class
| # or
(?<=\d) # the previous character must be a digit
: # match literal
(?=\d) # the next character must be a digit
)+ # end the non-capture group and execute one or more times
(?=\d)
is a positive lookahead.
Upvotes: 0
Reputation: 626929
You can use
import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
chunks = sentence.split(': ', 1)
result = re.findall(r'[^][\s,<>]+', chunks[0])
result.append(chunks[1])
print(result)
# => ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']
See the Python demo
Here,
chunks = sentence.split(': ', 1)
- splits the sentence into two chunks with the first :
substringresult = re.findall(r'[^][\s,<>]+', chunks[0])
- extracts all substrings consisting of one or more chars other than ]
, [
, whitespace, ,
, <
and >
chars from the first chunkresult.append(chunks[1])
- append the second chunk to the result
list.Upvotes: 1