Reputation: 39
I have been looking for a regular expression that will identify a quoted string in the content of an HTML page but NOT if the quotes are part of attributes of HTML tags.
Example:
<p id="123">This is some "quoted text" in a <span class="test">sentence.</span></p>
In the above line, I want to find "quoted text" string but not id="123" or class="test".
I have tried a few but none work.
The following REGEX picks up the HTML tags in the above example and excludes the sentence content...but I want it to do the opposite:
<[^>]+>
Upvotes: 1
Views: 261
Reputation: 3039
In this particular context, I don't think you're going to have many guarantees. There are too many options for how quoted strings can be put together within a snippet of HTML. However, based on the specific example you gave above, the following expression would find "quoted text":
(?<=(?:^|>)[^<>]*)"[^"]+"(?=[^<>]*(?:<|$))
Upvotes: 0
Reputation: 13864
If you want to parse HTML to get useful things out of it, use HTMLAgilityPack - it makes it fairly straightforward to do things like this.
See also: You can't use Regex'es to parse HTML
Upvotes: 3