Chris
Chris

Reputation: 23

Regex using several repeatable capture groups

I have a very uniform set of data from Radius messages that I need to add into our log management solution. The product offers the ability to use a regex statement to pull out the various data in a few forms.

1) Individual regular expressions for each piece of data you wish to pull out

    <data 1 = regex statement>
    <data 2 = different regex statement>    
    <data 2 = yet another regex statement>

2) A singular regular expression using capture groups

    <group = regex statement with capture groups>
        <data 1 = capture group[X]
        <data 2 = capture group[Y]
        <data 3 = capture group[Z]
    </group>

<158>Jul 6 14:33:00 radius/10.10.100.12 radius: 07/06/2010 14:33:00 AP1A-BLAH (10.10.10.10) - 6191 / Wireless - IEEE 802.11: abc1234 - Access-Accept (AP: 000102030405 / SSID: bork / Client: 050403020100) 

I want to pull out several bits of data, all of them between spaces. Something along the lines of the following doesn't seem efficient:

(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s

So, given the data above, what's the most efficient Java Regex that will grab each field in between a set of spaces and put it into a capture group?

Upvotes: 2

Views: 497

Answers (2)

Tim Pietzcker
Tim Pietzcker

Reputation: 336128

I just thought of something else - why not simply split the string on whitespace?

String[] splitArray = subjectString.split("\\s");

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336128

You could be more specific:

(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s

\S matches a non-space character - this makes the regex more efficient by avoiding backtracking, and it allows the regex to fail faster if the input doesn't fit the pattern.

I.e., when applying your regex to the string Jul 6 14:33:00 radius/10.10.100.12 radius: 07/06/2010, it takes the regex engine 2116 steps to find out that it can't match. The regex above fails in 168 steps.

Alan Moore's suggestion to use (\S*+)\s(\S*+)\s(\S*+)\s(\S*+)\s(\S*+)\s(\S*+)\s results in another improvement - now the regex fails within 24 steps (nearly a hundred times faster than the initial regex).

If the match is successful, Alan's and my solution are equivalent, your regex is about ten times slower.

Upvotes: 2

Related Questions