Why does this python re only capture one digit?

Question

I'm trying to use python RE module to capture specific digits of strings like '03' in ' video [720P] [DHR] _sp03.mp4 '.

And what confused me is :

when I use '.*\D+(\d+).*mp4', it succeed to capture both the two digits 03 , but when I use '.*\D*(\d+).*mp4', it only captured the rear digit 3.

I know python uses a greedy mode as default, which means trying to match as much text as possible. Considering this, I think * and + after the \D should behave samely. So where am I wrong? What leads to this difference? Can anyone help explain it?

BTW: I used online regex tester for python: https://regex101.com/#python

nu11p01n73R · Accepted Answer

What makes the difference is not the \D+ but the first .*

Now in regex .* is greedy and tries to match as much as characters as possible as it can

So when you write

.*\D*(\d+).*mp4

The .* will match as much as it can. That is if we try to break it down, it would look like

video [720P] [DHR] _sp03.mp4
|
.*

video [720P] [DHR] _sp03.mp4
 |
 .*
.....

video [720P] [DHR] _sp03.mp4
                      |
                      .* That is 0 is also matched by the .

video [720P] [DHR] _sp03.mp4
                      |
                      \D* Since the quantfier is zero or more, it matches nothing here without advancing to 3

video [720P] [DHR] _sp03.mp4
                       |
                      (\d+)

video [720P] [DHR] _sp03.mp4
                        |
                        .*

video [720P] [DHR] _sp03.mp4
                          |
                         mp4

Now when we use the \D+, the matching changes a bit, because the regex engine will be forced to match at least 1 non digit(\D+) before the digits ((\d+)). This will be consume the p which is the last non digit before the digits

That is

.* will try to match as much as it can till p, so that the \D+ can match at least one non digit which is p and \d+ will match you the 03 part

video [720P] [DHR] _sp03.mp4
|
.*

video [720P] [DHR] _sp03.mp4
 |
 .*
.....

video [720P] [DHR] _sp03.mp4
                     |
                     \D+ The first non digit. Forced to match at least once.

video [720P] [DHR] _sp03.mp4
                      |
                      (\d+) 

video [720P] [DHR] _sp03.mp4
                       |
                      (\d+)

video [720P] [DHR] _sp03.mp4
                        |
                        .*

video [720P] [DHR] _sp03.mp4
                          |
                         mp4

Why does this python re only capture one digit?

Answers (2)

Related Questions