JunKim
JunKim

Reputation: 787

Strip feature using python regex

I am a beginner of Python and just learnt about regex.
What I am trying to do is making a strip feature(strip()) using regex method.
Below is the code I wrote,

import regex

stripRegex = regex.compile(r"(\s*)((\S*\s*\S)*)(\s*)")
text = '              Hello World This is me Speaking                                    '
check = stripRegex.search(text)
print(check)
print('group 1 :', stripRegex.search(text).group(1))
print('group 2 :', stripRegex.search(text).group(2))
print('group 3 :', stripRegex.search(text).group(3))
print('group 4 :', stripRegex.search(text).group(4))

And the result is,

group 1 :
group 2 : Hello World This is me Speaking
group 3 : peaking
group 4 :

Here, I wonder two things.
1) How come group 3 returns 'peaking'?
2) Does python recognize '(' in order and assign the number that comes first?
So in this code, (\s*)((\S*\s*\S))(\s)
the first (\s*) - is the first group
((\S*\s*\S)) - the second,
(\S
\s*\S) - the third,
the second (\s*) - the fourth.

AM I right?

Upvotes: 1

Views: 96

Answers (2)

CodeBoy
CodeBoy

Reputation: 3300

Q2: You are right. From left to right, the first ( is the start of group 1, the second ( is the start of group 2, etc.

Q1: Group 3 is repeatedly matching because of the * before it. It's final value will be the value of the final match. Matches for group 3 are:

"Hello W" where \S*="Hello"   \s*=" "   \S="W"
"orld T"  where \S*="orld"    \s*=" "   \S="T" 
"his i"   where \S*="his"     \s*=" "   \S="i"
"s m"     where \S*="s"       \s*=" "   \S="m"
"e S"     where \S*="e"       \s*=" "   \S="S"
"peaking" where \S*="peakin"  \s*=""    \S="g"

Here is a fantastic tool for understanding your regexs: https://regex101.com/r/MmYOPT/1 (although it doesn't help as much with this repeating match).

Upvotes: 1

Kind Stranger
Kind Stranger

Reputation: 1761

You are correct. \S*\s*\S matches:

\S* - at least 0 non-whitespace
\s* - at least 0 whitespace
\S  - one non-whitespace

Group 3 (\S*\s*\S) is repeated to feed group 2 ((\S*\s*\S)*) and, as such, group 3 will contain the last match it fed to group 2: the last possible match for 0 or more non-whitespace followed by 0 or more whitespace followed by one non-whitespace is 'tring'. This can be explained by its first match:

'Hello T'
\S* matches 'Hello'
\s* matches ' '
\S  matches 'T'

If you repeat this, you will be taking the first letter from the front of each word:

'his i'
\S* matches 'his'
\s* matches ' '
\S  matches 'i'

And so on, until...

The final match then omits the first letter of the last word, doesn't require any whitespace and must finish with one non-whitespace:

'tring'
\S* matches 'trin'
\s* matches ''      (at least 0 whitespace, so zero)
\S  matches 'g'

Upvotes: 1

Related Questions