Crash
Crash

Reputation: 219

JavaScript + RegEx Complications- Searching Strings Not Containing SubString

I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:

matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');

data.replace(matcher, "$1");

The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?

EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\@\\@ASSET_ID\\@\\@_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).

EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!

Upvotes: 1

Views: 136

Answers (1)

Suamere
Suamere

Reputation: 6248

First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:

This is my Text

And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.

That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:

<[^>]+xxx((?!zzz).)*zzz

From there I examine what it's doing.

  1. You are looking for an HTML opening Delimeter <. You consume it.
  2. You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
  3. You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
  4. The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
  5. When the backtracking fails, it will look for the closing text and gather it successfully.

The result of that logic:

<[^>]*xxx((?!zzz).)*?zzz

If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:

<[^>]*xxx.*?zzz

So for Javascript, your code would say:

matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');

I put the IgnoreCase "i" in there for good measure, but you may or may not want that.

Upvotes: 3

Related Questions