Reputation: 1437
Given the following HTML content (limited to the absolute minimum I require):
How would I be able to extract Page Title
using Regex?
Upvotes: 0
Views: 833
Reputation: 39394
As others have commented, regular expressions may not be suitable for a bullet-proof method. E.g. using regex, it would be difficult to check if the <title>
tag were part of a quoted string within the HTML. That's a recurring response on StackOverflow for questions like this. But personally, I think you've got a point that a parser would be overkill for such a simple extraction. If you're looking for a method that works most of the time, one of the following should surfice.
Option 1: Lookbehind / lookahead
(?<=<title[\s\n]*>[\s\n]*)(.(?![\s\n]*</title[\s\n]*>))*
This uses lookbehind and lookahead for the tags - .NET has a sophisticated regex engine that allows for infinite repetition so you can even check for whitespace/return characters between the tag name and end brace (see this answer).
Option 2: Capturing group
<title[\s\n]*>[\s\n]*(.*)[\s\n]*</title[\s\n]*>
Similar but slightly simpler - the whole regex match includes the start and end tags. The first (and only) capturing group (.*)
captures the bit that is of interest in between.
Visualisation:
Upvotes: 1