Reputation: 15010
I have some text data as follows.
{"Timestamp": "Tue Apr 07 00:32:29 EDT 2015",Title: Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance <br><br>Product: Gecko<br>CPUs: 8<br>Language: en-GB"}
From the below text, I am extracting title only (Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance
) using the following regular expression:
appcodename = re.search(r'Title: ((?:(?!<br>).)+)', message).group(1)
I am trying to understand how the above regular expression works.
(?!<br>)
is a negative lookahead for <br>
(?:(?!<br>).)+)
- what does this mean? Can someone break it down for me.
Also, how many capture groups are there in the regular expression.
Upvotes: 1
Views: 80
Reputation: 627093
You do not need such a complicated regex to get the title. Use
Title:\s*(.*?)(?=\s*<br/?>)
See demo
We match Title:
, then whitespace \s*
, then any characters up tp <br/>
with (.*?)(?=\s*<br/?>)
.
As for (?:(?!<br>).)+
, it means capture 1 or more characters not followed with <br>
. There is an SO post where this construction is explained in detail.
Here is an image from regex101 (go to Regex Debugger tab, then click +
on the right) with the visualization what that construction is doing (checks if the next character is <br>
, and if not, consumes and backtracks, etc):
As for the question regarding how many capture groups are there in the regular expression, Title: ((?:(?!<br>).)+)
has 1 capturing (((?:(?!<br>).)+)
) and 1 non-capturing ((?:(?!<br>).)
) groups.
Upvotes: 3
Reputation: 4897
What ((?:(?!<br>).)+)
means is:
((?:(?!<br>).)+)
^... Match the regex and capture its match into backreference 1
((?:(?!<br>).)+)
^... Match the regex (non capturing group)
((?:(?!<br>).)+)
^... Assert that it is not possible to match the regex <br>
((?:(?!<br>).)+)
^... Match a single character, that is not a line break character
((?:(?!<br>).)+)
^... Between one and unlimmited times
Upvotes: 1
Reputation: 785581
First of all you don't need lookahead here. What you're doing can be done using this simple regex also:
>>> re.search(r'Title: *(.+?) *<br>', message).group(1)
"Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance"
btw your regex:
Title: ((?:(?!<br>).)+)
is using a negative lookahead (?!<br>)
which checks presence of <br>
before matching character after literal text Title:
.
Upvotes: 2