webpointz
webpointz

Reputation: 39

RegEx that will find quoted strings but NOT inside HTML tags

I have been looking for a regular expression that will identify a quoted string in the content of an HTML page but NOT if the quotes are part of attributes of HTML tags.

Example:

<p id="123">This is some "quoted text" in a <span class="test">sentence.</span></p>

In the above line, I want to find "quoted text" string but not id="123" or class="test".

I have tried a few but none work.

The following REGEX picks up the HTML tags in the above example and excludes the sentence content...but I want it to do the opposite:

<[^>]+>

Upvotes: 1

Views: 261

Answers (2)

Kenneth K.
Kenneth K.

Reputation: 3039

In this particular context, I don't think you're going to have many guarantees. There are too many options for how quoted strings can be put together within a snippet of HTML. However, based on the specific example you gave above, the following expression would find "quoted text":

(?<=(?:^|>)[^<>]*)"[^"]+"(?=[^<>]*(?:<|$))

Upvotes: 0

PhonicUK
PhonicUK

Reputation: 13864

If you want to parse HTML to get useful things out of it, use HTMLAgilityPack - it makes it fairly straightforward to do things like this.

See also: You can't use Regex'es to parse HTML

Upvotes: 3

Related Questions