Jeff Escalante
Jeff Escalante

Reputation: 3167

What's wrong with this RegEx?

I'm trying to implement this in a small ruby script, and tested it on http://www.rubular.com/, where it worked perfectly. Not sure why its not performing in the actual script.

The RegEx: /(motion|links|sound|button|symbol)|(0.\d{8})|(\s\d{1}\s)|(\d{10}\s)/

The Text it's Against:

Trial ID: 1 | Trial Type: motion | Trick? 1 Click Time: 0.87913100 1302969732

Trial ID: 7 | Trial Type: button | Trick? 0 Click Time: 0.19817800 1302987043

etc. etc.

What I am trying to grab: Only the numbers, and the single word after "Trial Type". So for the first line of the example, I would only want " 1 motion 1 0.87913100 1302969732" to be returned. I also want to keep the space before the first number in each trial.

My short ruby script:

File.open('log.txt', 'r') do |file|
  contents = file.readlines.to_s
  regex = Regexp.new(/(motion|links|sound|button|symbol)|(0\.\d{8})|(\s\d{1}\s)|(\d{10}\s)/)
  matchdata = regex.match(contents).to_a
  matchdata.each do |match|
    if match != nil
      puts match
    end
  end
end

It only outputs two "1"s though. Hmm... I know its reading the file contents right, and when I tried an alternate simplet regex it worked fine.

Thanks for any help I get here!! : )

Upvotes: 1

Views: 130

Answers (4)

the Tin Man
the Tin Man

Reputation: 160631

This is one of those times that trying to everything in a big regex makes you work too hard. Simplify things:

ary = [
  'Trial ID: 1 | Trial Type: motion | Trick? 1 Click Time: 0.87913100 1302969732',
  'Trial ID: 7 | Trial Type: button | Trick? 0 Click Time: 0.19817800 1302987043'
]

ary.each do |li|
  numbers = li.scan(/[\d.]+/)
  trial_type = li[/Trial Type: (\w+)/, 1]

  puts "%d %s %d %f %d\n" % [numbers.first, trial_type, *numbers[1 .. -1]]
end
# >> 1 motion 1 0.879131 1302969732
# >> 7 button 0 0.198178 1302987043

Regex patterns are powerful, but people think it's macho to do everything in one big line. You have to weigh doing that with the increased work necessary to put together the regex in the first place, plus maintain it if something changes in the text being parsed later.

Upvotes: 1

sawa
sawa

Reputation: 168269

If you know that the data follows a particular pattern, you can just follow that pattern in the regex, and pick up the portions you want with ( ).

/Trial ID: (\d+) \| Trial Type: (\w+) \| Trick\? (\d+) Click Time: ([\.\d]+) ([\.\d]+)/

The more you know previously about the data, the more specifically you can make the regex. If you see some variations in the data, and the regex fails to match, then just relax the pattern:

  • If the Trail ID, Trail ID may include a decimal point, use [\.\d]+ instead of \d+.
  • If the space can be more than one, then replace it with []+
  • If the space can be a tab, or can be absent, use \s* or [ \t]*.
  • If the Trial ID: part may appear as a different phrase, replace it with .*?,

and so on.

If you are not sure how many spaces/tabs appear, use this:

/Trial\s*ID:\s*(\d+)\s*\|\s*Trial\s*Type:\s*(\w+)\s*\|\s*Trick\?\s*(\d+)\s*Click\s*Time:\s*([\.\d]+)\s+([\.\d]+)/

Upvotes: 2

Mike Pennington
Mike Pennington

Reputation: 43097

You need to escape the literal pipes inside the regex, fill in other missing literals (like Trick, \?, Click\sTime:, remove some of the spaces, etc...), and insert regex spaces where appropriate... i.e.

regex = Regexp.new(/(motion|links|sound|button|symbol)\s\|\sTrick\?\s*\d\s*Click\s+Time:\s+(0\.\d{,8})\s(\d{10}))/)

EDIT: fixed parenthesis nesting in the original

Upvotes: 3

Mike Lewis
Mike Lewis

Reputation: 64177

You want to use String#scan

 matchdata = contents.scan(regex)

Also @Mike Penington is correct, you shouldn't have to do the if match != nil if you do it right. You have to clean up your regex as well. The pipe character in regex is a special character to denote match the left side OR the right side, and you have the litteral pipe character that you must escape.

Upvotes: 4

Related Questions