liv2hak
liv2hak

Reputation: 15010

Explanation for a complicated regular expression

I have some text data as follows.

{"Timestamp": "Tue Apr 07 00:32:29 EDT 2015",Title: Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance <br><br>Product: Gecko<br>CPUs: 8<br>Language: en-GB"}

From the below text, I am extracting title only (Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance) using the following regular expression:

appcodename = re.search(r'Title: ((?:(?!<br>).)+)', message).group(1)

I am trying to understand how the above regular expression works.

(?!<br>) is a negative lookahead for <br>

(?:(?!<br>).)+) - what does this mean? Can someone break it down for me. Also, how many capture groups are there in the regular expression.

Upvotes: 1

Views: 80

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627093

You do not need such a complicated regex to get the title. Use

Title:\s*(.*?)(?=\s*<br/?>)

See demo

We match Title:, then whitespace \s*, then any characters up tp <br/> with (.*?)(?=\s*<br/?>).

As for (?:(?!<br>).)+, it means capture 1 or more characters not followed with <br>. There is an SO post where this construction is explained in detail.

Here is an image from regex101 (go to Regex Debugger tab, then click + on the right) with the visualization what that construction is doing (checks if the next character is <br>, and if not, consumes and backtracks, etc):

enter image description here

As for the question regarding how many capture groups are there in the regular expression, Title: ((?:(?!<br>).)+) has 1 capturing (((?:(?!<br>).)+)) and 1 non-capturing ((?:(?!<br>).)) groups.

Upvotes: 3

Andie2302
Andie2302

Reputation: 4897

What ((?:(?!<br>).)+) means is:

((?:(?!<br>).)+)
^... Match the regex and capture its match into backreference 1

((?:(?!<br>).)+)
 ^... Match the regex (non capturing group)

((?:(?!<br>).)+)
    ^... Assert that it is not possible to match the regex <br>

((?:(?!<br>).)+)
            ^... Match a single character, that is not a line break character 

((?:(?!<br>).)+)
              ^... Between one and unlimmited times

Upvotes: 1

anubhava
anubhava

Reputation: 785581

First of all you don't need lookahead here. What you're doing can be done using this simple regex also:

>>> re.search(r'Title: *(.+?) *<br>', message).group(1)
"Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance"

btw your regex:

Title: ((?:(?!<br>).)+)

is using a negative lookahead (?!<br>) which checks presence of <br> before matching character after literal text Title:.

Upvotes: 2

Related Questions