Problems searching HTML with Regular Expressions

Question

I'm trying to use regular expressions for the first time and having some trouble, maybe with my syntax.

Here's a sample string contained in my source file I'd like to find:

Type = Creature / Animal / Elephant

"Type = " is static, however the three values between the forward slashes can change.

The search string I'm using is:

\bType = .*/.*/.*\b

My search string works fine, however my source file is HTML and some of the strings have HTML code embedded:

Type = Creature / Animal / Elephant 
Type = Creature / Animal / Elephant

Type = Creature / Animal / Elephant

Stuff like that (it not very good HTML, maybe copy-pasted from Microsoft Word?)

For my search expression, this is one of the results:

Type = Creature / Many Fish / Tuna



I don't understand why the result isn't stopping at "&" or "<" after Tuna.

Any thoughts on how my expression has to be changed to handle these variants?

I'm using working in VBA in Microsoft Excel, using the Microsoft VBScript Regular Expressions 5.5 library.  Thank you.

femtoRgon · Accepted Answer

Your regex:

.*/.*/.*\b

Is consuming too much, since .* captures greedily. You could match them all reluctantly, but the logic you want here is a bit unclear with regards to making that work. So, instead, this will specify more precisely what should be matched.

[^/]*/[^/]*/ \w+

Rather than .*, using [^/]* meaning anything but a "/", so it will prevent greedily consuming past a slash, particularly when there are trailing slashes, as in a couple of your examples. \w+ is a space followed by 1 or more word characters (letters, digits, underscores). It will not consume whitespace or & but it sounds like that is the intent.

Really though, I suspect the better solution for you is to not use regex for this at all.

Problems searching HTML with Regular Expressions

Answers (1)

Related Questions