Sergey Kritskiy
Sergey Kritskiy

Reputation: 2269

regex match special case

I have a cinematic scenario with a bunch of strings like this:

80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla

And my goal is to match all the numbers up to : in selected sequence (selected is 80101_ in this example, strings #2, #3, #5, #6), matching strings without existing numbers (like 80101_:Blablab, string #4) but without matching the string with _intertitle (string #1).

My current regex looks like this (code in Python):

selection = "80101"; # I'm getting this from elsewhere
pattern = selection + "_" + "\d*";

This matches all the strings with/without numbers but also a string with _intertitle. If I modify my pattern like this "\d[^:]*", it doesn't match _intertitle but also doesn't match the string without numbers... I can't get the right pattern, could anyone please lead me in the right direction? Thanks.

Upvotes: 1

Views: 280

Answers (4)

suit
suit

Reputation: 581

I think you should add "(?=:)" in the and of your pattern:

r"80101_\d*(?=:)"

This means: select "80101_" + zero or more digits only if it’s followed by ":". In case of "80101_intertitle:Blablabla" we have a non-digit symbol between "80101_" and ":", so it doesn't match.

Upvotes: 1

mrCarnivore
mrCarnivore

Reputation: 5078

Yes, that is easily done:

import re

s = '''80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla'''

matches = re.findall(r'(80101_\d+:.*)', s)
for match in matches:
    print(match)
matches = re.findall(r'(80101_:.*)', s)
for match in matches:
    print(match)

Upvotes: 0

Jerry
Jerry

Reputation: 71548

You could use a negative lookahead:

80101_\d*(?!intertitle)

That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used.

regex101 demo

Your pattern could be written as:

pattern = selection + r"_\d*(?!intertitle)"

Upvotes: 1

Kasravnd
Kasravnd

Reputation: 107297

You need anchors and multiline flag. Also, you should add the :.* at the end of the regex as well to match the whole string.

^80101_\d*:.*$

See the Demo: https://regex101.com/r/yqGgrv/1

Here is the respective python code as well:

In [1]: s = """80101_intertitle:Blablabla
   ...: 80101_1:BlablablaBlablabla
   ...: 80101_2:Blablabla
   ...: 80101_:BlablablaBlablablaBlablabla
   ...: 80101_3:BlablablaBlablabla
   ...: 80101_11:Blablabla
   ...: 801_1:Blablabla
   ...: 801_2:Blablabla"""

In [2]: import re
In [4]: re.findall(r'^80101_\d*:.*$', s, re.M)
Out[4]: 
['80101_1:BlablablaBlablabla',
 '80101_2:Blablabla',
 '80101_:BlablablaBlablablaBlablabla',
 '80101_3:BlablablaBlablabla',
 '80101_11:Blablabla']

Upvotes: 0

Related Questions