Richard Burton
Richard Burton

Reputation: 2240

Multiple line regex in ruby

I am trying to strip some repeated text out of my Kindle clippings that look like this:

 The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
 ==========
 Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
 - Highlight on Page 25 | Added on Friday, 25 November 11 10:53:36 Greenwich Mean Time

 commentators (a euphemism for prolific writers with little experience
 ==========
 Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
 - Highlight on Page 26 | Added on Friday, 25 November 11 10:54:29 Greenwich Mean Time

I am trying to strip out everthing between "Essentials" and "Time". The regexp I am playing with right now looks like this:

Essentials([^,]+)Time

But obviously it is not working:

http://rubular.com/r/gwSJFgOQai

Any help for this nuby would be massively appreciated!

Upvotes: 2

Views: 282

Answers (3)

the Tin Man
the Tin Man

Reputation: 160551

Regex are powerful, but you'll find they also often add needless complexity to a problem.

This is how I'd go about the problem:

text = <<EOT
The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 25 | Added on Friday, 25 November 11 10:53:36 Greenwich Mean Time

commentators (a euphemism for prolific writers with little experience
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 26 | Added on Friday, 25 November 11 10:54:29 Greenwich Mean Time
EOT

text.each_line do |l|
  l.chomp!
  next if ((l =~ /\AEssentials/) .. (l =~ /Time\z/))

  puts l
end

Which outputs:

The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
==========

commentators (a euphemism for prolific writers with little experience
==========

This works because the .., AKA range operator, gains new capability when used with an if, and turns into what we call the flip-flop operator. In operation what happens is ((l =~ /\AEssentials/) .. (l =~ /Time\z/)) returns false, until (l =~ /\AEssentials/) matches. From then until (l =~ /Time\z/) matches it returns true. Once the final regex matches it returns to returning false.

This behavior works really well for extracting sections from text.

If you are aggregating text, for subsequent output, replace the puts l with something to append l to a buffer, then output that buffer at the end of your run.

Upvotes: 1

bozdoz
bozdoz

Reputation: 12860

Why don't you use this:

/Essentials(.*?)Time/m

Updated. Forgot the m for multiline.

Upvotes: 3

Mark Thomas
Mark Thomas

Reputation: 37517

You need the /m modifier which makes . match a newline:

/Essentials(.*?)Time/m

See it working here: http://rubular.com/r/qgmkWnLzW6

Upvotes: 7

Related Questions