lincr
lincr

Reputation: 1653

Why does this python re only capture one digit?

I'm trying to use python RE module to capture specific digits of strings like '03' in ' video [720P] [DHR] _sp03.mp4 '.

And what confused me is :

when I use '.*\D+(\d+).*mp4', it succeed to capture both the two digits 03 , but when I use '.*\D*(\d+).*mp4', it only captured the rear digit 3.

I know python uses a greedy mode as default, which means trying to match as much text as possible. Considering this, I think * and + after the \D should behave samely. So where am I wrong? What leads to this difference? Can anyone help explain it?

BTW: I used online regex tester for python: https://regex101.com/#python

Upvotes: 2

Views: 245

Answers (2)

Somendra Joshi
Somendra Joshi

Reputation: 1863

The problem is with \D*. The '+' is for one or more and '*' is for zero or more.

As you have used '.*' in starting it become greedy and takes till ' video [720P] [DHR] _sp0' where in '\D+' case it quits at ' video [720P] [DHR] _s' leaving 'p' for \D+

>>> import re
>>> a = " video [720P] [DHR] _sp03.mp4 "
>>> p1 = re.compile('.*\D+(\d+).*mp4')
>>> p2 = re.compile('.*\D*(\d+).*mp4')
>>> re.findall(p1,a)
['03']
>>> re.findall(p2,a)
['3']
>>> a
' video [720P] [DHR] _sp03.mp4 '
>>> p3 = re.compile('(.*)(\D*)(\d+)(.*)mp4')
>>> re.findall(p3,a)
[(' video [720P] [DHR] _sp0', '', '3', '.')]
>>> p4 = re.compile('(.*)(\D+)(\d+)(.*)mp4')
>>> re.findall(p4,a)
[(' video [720P] [DHR] _s', 'p', '03', '.')]

Upvotes: 1

nu11p01n73R
nu11p01n73R

Reputation: 26667

What makes the difference is not the \D+ but the first .*

Now in regex .* is greedy and tries to match as much as characters as possible as it can

So when you write

.*\D*(\d+).*mp4

The .* will match as much as it can. That is if we try to break it down, it would look like

video [720P] [DHR] _sp03.mp4
|
.*

video [720P] [DHR] _sp03.mp4
 |
 .*
.....

video [720P] [DHR] _sp03.mp4
                      |
                      .* That is 0 is also matched by the .

video [720P] [DHR] _sp03.mp4
                      |
                      \D* Since the quantfier is zero or more, it matches nothing here without advancing to 3

video [720P] [DHR] _sp03.mp4
                       |
                      (\d+)

video [720P] [DHR] _sp03.mp4
                        |
                        .*

video [720P] [DHR] _sp03.mp4
                          |
                         mp4

Now when we use the \D+, the matching changes a bit, because the regex engine will be forced to match at least 1 non digit(\D+) before the digits ((\d+)). This will be consume the p which is the last non digit before the digits

That is

.* will try to match as much as it can till p, so that the \D+ can match at least one non digit which is p and \d+ will match you the 03 part

video [720P] [DHR] _sp03.mp4
|
.*

video [720P] [DHR] _sp03.mp4
 |
 .*
.....

video [720P] [DHR] _sp03.mp4
                     |
                     \D+ The first non digit. Forced to match at least once.

video [720P] [DHR] _sp03.mp4
                      |
                      (\d+) 

video [720P] [DHR] _sp03.mp4
                       |
                      (\d+)

video [720P] [DHR] _sp03.mp4
                        |
                        .*

video [720P] [DHR] _sp03.mp4
                          |
                         mp4

Upvotes: 8

Related Questions