John Dough
John Dough

Reputation: 125

Extract string from HTML tags using RegExp (Ruby)

I would like to extract "toast" from a string <h1>test</h1><div>toast</div>. What regular expression could isolate such a string?

Edit: Thanks to the user who who corrected the formatting.

More Info: There will always only be one instance of the div tag, the information inside may change but there will never be another div tag in the same string (the string is larger than the given sample)

Thanks!

Upvotes: 1

Views: 6985

Answers (3)

Smern
Smern

Reputation: 19066

This is really not something that is typically done with regex... and for a good reason, but if you must and since you said there will never be more than a single div within it... this should work for you:

(?<=<div>).*(?=</div>)

Upvotes: 1

James Lim
James Lim

Reputation: 13054

We need more information. If the string is exactly "<h1>test</h1><div>toast</div>", then something naïve like

regex = /<h1>test<\/h1><div>([^<]*)<\/div>/
found = "<h1>test</h1><div>toast</div>".match(regex)[1]
# => "toast"

would work. My best guess at this point is that you are expecting

<h1>*</h1><div>*</div>

then use this:

regex = /<h1>[^<]*<\/h1><div>([^<]*)<\/div>/
found = "<h1>any string can go here</h1><div>toast</div>".match(regex)[1]
# => "toast"

Note that this breaks if there are any nested elements in either tag. A more robust solution is to use Nokogiri. Talk to your boss.

Upvotes: 1

Arup Rakshit
Arup Rakshit

Reputation: 118261

You can use Nokogiri.

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse("<div> test </div> <div> toast </div>")
doc.css('div').map(&:text)
# => [" test ", " toast "]

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse("<h1>test</h1><div>toast</div>")
doc.at_css('div').text
# => "toast"

Upvotes: 6

Related Questions