Reputation: 15010

Explanation for a complicated regular expression

I have some text data as follows.

{"Timestamp": "Tue Apr 07 00:32:29 EDT 2015",Title: Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance <br><br>Product: Gecko<br>CPUs: 8<br>Language: en-GB"}

From the below text, I am extracting title only (Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance) using the following regular expression:

appcodename = re.search(r'Title: ((?:(?!<br>).)+)', message).group(1)

I am trying to understand how the above regular expression works.

(?! ) is a negative lookahead for  

(?:(?! ).)+) - what does this mean? Can someone break it down for me. Also, how many capture groups are there in the regular expression.

Upvotes: 1

Answers (3)

Wiktor Stribiżew

Reputation: 627093

You do not need such a complicated regex to get the title. Use

Title:\s*(.*?)(?=\s*<br/?>)

See demo

We match Title:, then whitespace \s*, then any characters up tp   with (.*?)(?=\s*<br/?>).

As for (?:(?! ).)+, it means capture 1 or more characters not followed with  . There is an SO post where this construction is explained in detail.

Here is an image from regex101 (go to Regex Debugger tab, then click + on the right) with the visualization what that construction is doing (checks if the next character is  , and if not, consumes and backtracks, etc):

enter image description here

As for the question regarding how many capture groups are there in the regular expression, Title: ((?:(?! ).)+) has 1 capturing (((?:(?! ).)+)) and 1 non-capturing ((?:(?! ).)) groups.

Upvotes: 3

Andie2302

Reputation: 4897

What ((?:(?! ).)+) means is:

((?:(?!<br>).)+)
^... Match the regex and capture its match into backreference 1

((?:(?!<br>).)+)
 ^... Match the regex (non capturing group)

((?:(?!<br>).)+)
    ^... Assert that it is not possible to match the regex <br>

((?:(?!<br>).)+)
            ^... Match a single character, that is not a line break character 

((?:(?!<br>).)+)
              ^... Between one and unlimmited times

Upvotes: 1

anubhava

Reputation: 785581

First of all you don't need lookahead here. What you're doing can be done using this simple regex also:

>>> re.search(r'Title: *(.+?) *<br>', message).group(1)
"Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance"

btw your regex:

Title: ((?:(?!<br>).)+)

is using a negative lookahead (?! ) which checks presence of   before matching character after literal text Title:.

Upvotes: 2

Explanation for a complicated regular expression

Answers (3)

Related Questions