blaha
blaha

Reputation: 2745

Regex returning weird arrays

I want to make an array of results from a string like this one, using a regular expression:

results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday

Here’s my regex as it stands. It works in Sublime Text’s regex search but not in Ruby:

(results)\|.*?\\n(?=((results\|)|(timestamps\|\|)))

and this would be the desired result:

1. results|foofoofoo
2. results|barbarbar
3. results|googoogoo

Instead I’m getting these weird returns, and I can’t understand it. Why does this not select the result lines?

Match 1
1. results
2. results|
3. results|
4.  

Match 2
1. results
2. results|
3. results|
4.   

Match 3
1. results
2. timestamps||
3.  
4. timestamps||

Here’s the actual code using the regex:

#create new lines for each regex'd line body with that body set as the raw attribute
host_scan.raw.scan(/(?:results)\|.*?\\n(?=((?:results\|)|(?:timestamps\|\|)))/).each do |body|
  @lines << Line.new({:raw => body})
end

Upvotes: 0

Views: 147

Answers (4)

blaha
blaha

Reputation: 2745

The answer turned out to lie in the parentheses. Wrapping in parentheses caused it to return the entire match instead of just the tail delimiter.

host_scan.raw.scan(/((?:results\|.*?\\n)(?=(?:results\|)|(?:timestamps\|\|)))/).each do |body|
      @lines << Line.new({:raw => body})
end

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160551

Rather than jump to a regex, which is a much more complicated way to get at the data, use split("\n").

text = "results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday"
ary = text.split("\n")

ary is:

[
  "results|foofoofoo",
  "results|barbarbarbar",
  "results|googoogoo",
  "timestamps||friday"
]

Slice that and you can get:

ary[0..2]
=> ["results|foofoofoo", "results|barbarbarbar", "results|googoogoo"]

EDIT:

Based on the comment that there are more carriage returns and complex characters in the strings:

require 'awesome_print'

text = "results|foofoofoo\nmorefoo\nandevenmorefoo\nresults|barbarbarbar\nandmorebar\nandyetagainmorebar\nresults|googoogoo\ntimestamps||friday"
ap text.sub(/\|\|friday$/, '').split('results')[1..-1].map{ |l| 'results' << l }

Which outputs:

[
  [0] "results|foofoofoo\nmorefoo\nandevenmorefoo\n",
  [1] "results|barbarbarbar\nandmorebar\nandyetagainmorebar\n",
  [2] "results|googoogoo\ntimestamps"
]

Upvotes: 0

kopischke
kopischke

Reputation: 3413

As Kendall Frey already stated, you are creating too many capture groups. No need to group the first literal “results|”, and no need to group the elements of your alternate group in individual non backreferencing groups. What you are intending to do is this regex:

/results\|.*?(?=\\n(?:results\||timestamps\|\|))/

or, if you don’t mind repeating the \\n part, you can do away with the non-capturing subgroup:

/results\|.*?(?=\\nresults\||\\ntimestamps\|\|)/

– both will return an array of matched values as specified in your question.

Upvotes: 1

Kendall Frey
Kendall Frey

Reputation: 44326

I'm guessing it has something to do with capturing groups. If you change all your (...) to (?:...) it will eliminate capturing groups.

Upvotes: 0

Related Questions