DeanHyde
DeanHyde

Reputation: 107

Trying to figure out this regular expression pattern

So I have a string I'm trying to strip some values from. I've been using this regex tester to try to figure this out to no avail: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

This is the string I'm trying to parse:

9   2.27.8.18:2304        63   9dd0e5e7344adac5cf49b7882329df25(OK) Any number of characters follow here

The basic format goes:

INT IP:PORT INT MD5-HASH(OK) STRING

This is as far as I've got so far:

(?<line_id>[0-9]{1,3})(?<ip>.+):(?<port>[0-9]{1,5})(?<guid>.+)\(OK\)(?<name>.+)

And these are the values I've been able to strip so far:

9 (line_id)
2.27.8.18 (ip)
2304 (port)
63   9dd0e5e7344adac5cf49b7882329df25(guid)
Any number of characters follow here (name)

If you try the sample text and pattern I posted above, you can see that I get everything except the integer between the port number and the md5 hash (guid). I'm probably making some amateur mistake as I'm not too experienced with regex patterns so any input would be greatly appreciated.

Upvotes: 0

Views: 189

Answers (5)

Steve Mayne
Steve Mayne

Reputation: 22868

.+ is generally a bad idea, as it will greedily match any characters in your string.

(?<line_id>[0-9]{1,3})[\s]+(?<ip>[0-9\.]+):(?<port>[0-9]{1,5})[\s]+(?<int>[0-9]{1,5})[\s]+(?<guid>[a-z0-9]+)\(OK\)(?<name>.+)

This yields:

9 (line_id)
2.27.8.18 (ip)
2304 (port)
63 (int)
9dd0e5e7344adac5cf49b7882329df25 (guid)
 Any number of characters follow here (name)

Upvotes: 2

freedev
freedev

Reputation: 30237

The catch for the integer is missing.
I have added here a new named backreference called int.

Try this:

(?<line_id>[0-9]{1,3})(?<ip>.+):(?<port>[0-9]{1,5})\s+(?<int>[0-9]+)\s+(?<guid>.+)\(OK\)(?<name>.+)

now you have the following 6 capturing groups:

line_id group 1: (?[0-9]{1,3})
ip group 2: (?.+)
port group 3: (?[0-9]{1,5})
int group 4: (?[0-9]{1,5})
guid group 5: (?.+)
name group 6: (?.+)

IMHO, the latest two groups are too greedy. Instead of using .+ I'll suggest to identify better the range to characters you need to catch.

Upvotes: 1

Wolf
Wolf

Reputation: 2150

Try this one

(?<line_id>[0-9]{1,3})\s+(?<ip>.+):(?<port>[0-9]{1,5})\s+(?<number>[0-9]+)\s+(?<guid>.+)\(OK\)(?<name>.+)

Got this result in the test page you provided

has 6 groups:

9 (line_id)
2.27.8.18 (ip)
2304 (port)
63 (number)
9dd0e5e7344adac5cf49b7882329df25 (guid)
Any number of characters follow here (name)

*Note that space used to identify 63

Upvotes: 1

roeland
roeland

Reputation: 6336

Maybe it is easier to check the separator:

(?<line_id>[0-9]{1,3})(?<ip>.+):(?<port>[0-9]{1,5})\s+(?<nr>.*)\s+(?<guid>.+)\(OK\)(?<name>.+)

Here is an example: http://rubular.com/r/qhS7TdTFmn

Upvotes: 0

e_ne
e_ne

Reputation: 8469

You didn't set up a capturing group for that number (63 in your case), which was captured along with the guid. I've edited your pattern a little:

(?<line_id>\d{1,3})\s*(?<ip>.+):(?<port>\d{1,5})\s*(?<number>\d+?)\s*(?<guid>[\da-f]+)\(OK\)(?<name>.+)

Note that I've changed [0-9] sets to \d and the guid's set to: [\da-f] (in case it only uses hexadecimal lowercase characters.

Upvotes: 0

Related Questions