SLeepdepD
SLeepdepD

Reputation: 91

Problems searching HTML with Regular Expressions

I'm trying to use regular expressions for the first time and having some trouble, maybe with my syntax.

Here's a sample string contained in my source file I'd like to find:

Type = Creature / Animal / Elephant

"Type = " is static, however the three values between the forward slashes can change.

The search string I'm using is:

\bType = .*/.*/.*\b

My search string works fine, however my source file is HTML and some of the strings have HTML code embedded:

Type = Creature / Animal / Elephant 
Type = Creature / Animal / Elephant<br />
Type = Creature / Animal / Elephant</span></span></strong>

Stuff like that (it not very good HTML, maybe copy-pasted from Microsoft Word?)

For my search expression, this is one of the results:

Type = Creature / Many&nbsp;Fish&nbsp;/ Tuna&nbsp; </span></span></li

I don't understand why the result isn't stopping at "&" or "<" after Tuna.

Any thoughts on how my expression has to be changed to handle these variants?

I'm using working in VBA in Microsoft Excel, using the Microsoft VBScript Regular Expressions 5.5 library. Thank you.

Upvotes: 0

Views: 119

Answers (1)

femtoRgon
femtoRgon

Reputation: 33351

Your regex:

.*/.*/.*\b

Is consuming too much, since .* captures greedily. You could match them all reluctantly, but the logic you want here is a bit unclear with regards to making that work. So, instead, this will specify more precisely what should be matched.

[^/]*/[^/]*/ \w+

Rather than .*, using [^/]* meaning anything but a "/", so it will prevent greedily consuming past a slash, particularly when there are trailing slashes, as in a couple of your examples. \w+ is a space followed by 1 or more word characters (letters, digits, underscores). It will not consume whitespace or & but it sounds like that is the intent.

Really though, I suspect the better solution for you is to not use regex for this at all.

Upvotes: 1

Related Questions