Reputation: 1653
I'm trying to use python RE module to capture specific digits of strings like '03'
in ' video [720P] [DHR] _sp03.mp4 '
.
And what confused me is :
when I use '.*\D+(\d+).*mp4'
, it succeed to capture both the two digits 03
,
but when I use '.*\D*(\d+).*mp4'
, it only captured the rear digit 3
.
I know python uses a greedy mode as default, which means trying to match as much text as possible. Considering this, I think *
and +
after the \D
should behave samely. So where am I wrong? What leads to this difference? Can anyone help explain it?
BTW: I used online regex tester for python: https://regex101.com/#python
Upvotes: 2
Views: 245
Reputation: 1863
The problem is with \D*. The '+' is for one or more and '*' is for zero or more.
As you have used '.*' in starting it become greedy and takes till ' video [720P] [DHR] _sp0' where in '\D+' case it quits at ' video [720P] [DHR] _s' leaving 'p' for \D+
>>> import re
>>> a = " video [720P] [DHR] _sp03.mp4 "
>>> p1 = re.compile('.*\D+(\d+).*mp4')
>>> p2 = re.compile('.*\D*(\d+).*mp4')
>>> re.findall(p1,a)
['03']
>>> re.findall(p2,a)
['3']
>>> a
' video [720P] [DHR] _sp03.mp4 '
>>> p3 = re.compile('(.*)(\D*)(\d+)(.*)mp4')
>>> re.findall(p3,a)
[(' video [720P] [DHR] _sp0', '', '3', '.')]
>>> p4 = re.compile('(.*)(\D+)(\d+)(.*)mp4')
>>> re.findall(p4,a)
[(' video [720P] [DHR] _s', 'p', '03', '.')]
Upvotes: 1
Reputation: 26667
What makes the difference is not the \D+
but the first .*
Now in regex .*
is greedy and tries to match as much as characters as possible as it can
So when you write
.*\D*(\d+).*mp4
The .*
will match as much as it can. That is if we try to break it down, it would look like
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
.*
.....
video [720P] [DHR] _sp03.mp4
|
.* That is 0 is also matched by the .
video [720P] [DHR] _sp03.mp4
|
\D* Since the quantfier is zero or more, it matches nothing here without advancing to 3
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
mp4
Now when we use the \D+
, the matching changes a bit, because the regex engine will be forced to match at least 1 non digit(\D+
) before the digits ((\d+)
). This will be consume the p
which is the last non digit before the digits
That is
.*
will try to match as much as it can till p
, so that the \D+
can match at least one non digit which is p
and \d+
will match you the 03
part
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
.*
.....
video [720P] [DHR] _sp03.mp4
|
\D+ The first non digit. Forced to match at least once.
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
mp4
Upvotes: 8