Reputation: 219
I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:
matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');
data.replace(matcher, "$1");
The strangeness around the middle ( ((\\?\\!endText).)*
) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?
EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText
and endText
are \\@\\@ASSET_ID\\@\\@_<YYYY_MM_DD>
. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).
EDIT: Well, this was a stupid question. Apparently, I just forgot to add .*
after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!
Upvotes: 1
Views: 136
Reputation: 6248
First of all, why is there a .*
Dot Asterisk in the beginning? If you have text like the following:
This is my Text
And you want "my Text" pulled out, you do my\sText
. You don't have to do the .*
.
That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx)
is a huge no-no, and can almost always be replaced with this: xxx
. In other words, your regex could be replaced with:
<[^>]+xxx((?!zzz).)*zzz
From there I examine what it's doing.
<
. You consume it.<table border=2>
, then you have, at minimum, so far consumed <t
, if not more.table
, you'll never find it, because you have consumed the t
. So replace that +
with a *
.?
.The result of that logic:
<[^>]*xxx((?!zzz).)*?zzz
If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:
<[^>]*xxx.*?zzz
So for Javascript, your code would say:
matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');
I put the IgnoreCase "i" in there for good measure, but you may or may not want that.
Upvotes: 3