Reputation: 2240
I am trying to strip some repeated text out of my Kindle clippings that look like this:
The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 25 | Added on Friday, 25 November 11 10:53:36 Greenwich Mean Time
commentators (a euphemism for prolific writers with little experience
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 26 | Added on Friday, 25 November 11 10:54:29 Greenwich Mean Time
I am trying to strip out everthing between "Essentials" and "Time". The regexp I am playing with right now looks like this:
Essentials([^,]+)Time
But obviously it is not working:
http://rubular.com/r/gwSJFgOQai
Any help for this nuby would be massively appreciated!
Upvotes: 2
Views: 282
Reputation: 160551
Regex are powerful, but you'll find they also often add needless complexity to a problem.
This is how I'd go about the problem:
text = <<EOT
The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 25 | Added on Friday, 25 November 11 10:53:36 Greenwich Mean Time
commentators (a euphemism for prolific writers with little experience
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 26 | Added on Friday, 25 November 11 10:54:29 Greenwich Mean Time
EOT
text.each_line do |l|
l.chomp!
next if ((l =~ /\AEssentials/) .. (l =~ /Time\z/))
puts l
end
Which outputs:
The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
==========
commentators (a euphemism for prolific writers with little experience
==========
This works because the ..
, AKA range operator, gains new capability when used with an if
, and turns into what we call the flip-flop operator. In operation what happens is ((l =~ /\AEssentials/) .. (l =~ /Time\z/))
returns false, until (l =~ /\AEssentials/)
matches. From then until (l =~ /Time\z/)
matches it returns true. Once the final regex matches it returns to returning false.
This behavior works really well for extracting sections from text.
If you are aggregating text, for subsequent output, replace the puts l
with something to append l
to a buffer, then output that buffer at the end of your run.
Upvotes: 1
Reputation: 12860
Why don't you use this:
/Essentials(.*?)Time/m
Updated. Forgot the m for multiline.
Upvotes: 3
Reputation: 37517
You need the /m modifier which makes .
match a newline:
/Essentials(.*?)Time/m
See it working here: http://rubular.com/r/qgmkWnLzW6
Upvotes: 7