NotAgain
NotAgain

Reputation: 1977

Can this regex be made memory efficient

I get an xml as plain unformatted text blob. I have to make some replacements and I use regex find and replace. For example:

<MeasureValue><Text value="StartCalibration" /></MeasureValue>

has to be converted to

<MeasureValue type="Text" value="StartCalibration"/>

The regex I wrote was

<MeasureValue><((\w*)\s+value="(.*?)".*?)></MeasureValue>

And the replacement part was:

<MeasureValue type="$2" value="$3"/>

Here a link showing the same.

The issue is that in a file having 370 such occurrences, I get out of memory error. I have heard of the so called greedy regex patterns and wondering if this can be the case plaguing me. If this is already memory efficient then I will leave it as it is and try to increase the server memory. I have to process thousands of such documents.

EDIT: This is part of script for Logstash from Elasticsearch. As per documentation, Elasticsearch uses Apache Lucene internally to parse regular expressions. Not sure if that helps.

Upvotes: 0

Views: 794

Answers (1)

Caio Oliveira
Caio Oliveira

Reputation: 1243

As a rule of thumb, specificity is positively correlated with efficiency in regex. So, know your data and build something to surgically match it.

The more specific you build your regex, like literally writing down the pattern (and usually ending up with a freak regex), the fewer resources it will take due to the fewer "possibilities" it can match in your data.

To be more precise, imagine we are trying to match a string

2014-08-26 app[web.1]: 50.0.134.125

Approaches such as

(.*) (.*) (.*)

leaves it too open and prone to match with MANY different patterns and combinations, and thus takes a LOT more to process its infinite possibilities. check here https://regex101.com/r/GvmPOC/1

On the other han you could spend a little more time building a more elaborated expression such as:

^[0-9]{4}\-[0-9]{2}-[0-9]{2} app\[[a-zA-Z0-9.]+\]\: [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$`

and I agree, it is horrible but much more precise. It won't waste your precious resources finding unnecessary stuff. check here https://regex101.com/r/quz7fo/1

Another thing to have in mind is: operators such as * or + do a scan operation, which depending on the size of your string, might take some time. Also, whenever is possible, specifying the anchors ^$ also help the script not to try to find too many matches within the same string.


Bringing it to your reality...

If we have to use regex.

The million-dollar question is, how can we turn your regex into something more precise?

Since there is no limit to tag name lengths in XML... there is no way to make it utterly specific :(

  • We could try to specify what characters to match and avoid . and \w. So substitute it to something more like a-zA-Z is preferrable. Also making use of negative classes [^] would help to narrow down the range of possibilities.

  • Avoiding * and ? and try to put a quantifier {} (although I don't know your data to make this decision). And as I stated above, there is no limit in XML for this.

  • I didn't understand precisely the function of the ? in your code, so removing it is something less to process.

Ended up with something like

<(([a-zA-Z]+) value="([^"]*)"[^<>]*)>

Not many changes though. You can try to measure it to see if there was any improvement.

But perhaps the best approach is not to use regex at all :( I don't know the language you are working with, but if it is getting complicated with the processing time, I would suggest you to not use regex and try some alternative.

If there is a slight possibility to use a xml parser it would be preferable.

https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions

Sorry if it wasn't as conclusive as you might have expected, but the field for working on it is likewise really open.

Upvotes: 1

Related Questions