Ben
Ben

Reputation: 563

Parsing through text to find html tags in Ruby 1.9.x

I want to be able to match text in between two tags, starting at an opening tag and ending in a closing tag.

Say I have this block of text in a variable called 'text':

some text some text some text some text some text
<some_tag>
  some text some text some text some text some text
</some_tag>
some text some text some text some text some text

I want to parse the contents 'text' doing nothing until it finds an opening tag, in this case 'some_tag', and once it finds an opening tag I want it to capture everything until the tag closes.

I've been fooling around with blocks and regular expressions for about an hour now and cannot seem to figure out a good way to work this out.

I'd appreciate any and all pointers, thanks!

Upvotes: 1

Views: 1615

Answers (1)

the Tin Man
the Tin Man

Reputation: 160551

You should use a parser for HTML. Regex and HTML tends to make a volatile mix, that leads to insanity in large doses.

Using Nokogiri:

require 'nokogiri'

html = <<EOT
some text some text some text some text some text
<p>
  some text some text some text some text some text
</p>
some text some text some text some text some text
EOT

doc = Nokogiri::HTML::DocumentFragment.parse(html)

puts doc.search('p').map { |n| n.inner_text }

>>   some text some text some text some text some text

This is searching through the HTML fragment, looking for <p> tags. For each one it finds it'll extract the inner text.

I'm using Nokogiri's CSS mode, by using "p". I could use XPath instead, but CSS is understood by more people.

Upvotes: 5

Related Questions