Daan
Daan

Reputation: 1437

Extract title from HTML content

Given the following HTML content (limited to the absolute minimum I require):

enter image description here

How would I be able to extract Page Title using Regex?

Upvotes: 0

Views: 833

Answers (1)

Steve Chambers
Steve Chambers

Reputation: 39394

As others have commented, regular expressions may not be suitable for a bullet-proof method. E.g. using regex, it would be difficult to check if the <title> tag were part of a quoted string within the HTML. That's a recurring response on StackOverflow for questions like this. But personally, I think you've got a point that a parser would be overkill for such a simple extraction. If you're looking for a method that works most of the time, one of the following should surfice.

Option 1: Lookbehind / lookahead

(?<=<title[\s\n]*>[\s\n]*)(.(?![\s\n]*</title[\s\n]*>))*

This uses lookbehind and lookahead for the tags - .NET has a sophisticated regex engine that allows for infinite repetition so you can even check for whitespace/return characters between the tag name and end brace (see this answer).

Option 2: Capturing group

<title[\s\n]*>[\s\n]*(.*)[\s\n]*</title[\s\n]*>

Similar but slightly simpler - the whole regex match includes the start and end tags. The first (and only) capturing group (.*) captures the bit that is of interest in between.

Visualisation: Regular expression visualization

Edit live on Debuggex

Upvotes: 1

Related Questions