Reputation: 4632
I have an html like so:
<div class="content">
<h1>Title 1</h1>
Lorem ipsum 1
<h2>Title 2</h2>
Lorem ipsum 2
<h3>Title 3</h3>
<b>Lorem ipsum 3</b>
<h1>Title 4</h1>
Lorem ipsum 4
<h2>Title 5</h2>
Lorem ipsum 5
</div>
I want to extract content under each header title and place them into an array like so:
[
"Lorem ipsum 1",
"Lorem ipsum 2",
"<b>Lorem ipsum 3</b>",
"Lorem ipsum 4",
"Lorem ipsum 5"
]
How would I do that using regex and/or ruby? I tried playing around with split
method, like html_body.split(">")
, but still can't figure out how to do so correctly. What is the correct way to do it using regex and/or ruby?
Upvotes: 0
Views: 106
Reputation: 168091
You shouldn't reinvent the wheel. Using Nokogiri is more robust than trying from scratch.
require "nokogiri"
html = <<_
<div class="content">
<h1>Title 1</h1>
Lorem ipsum 1
<h2>Title 2</h2>
Lorem ipsum 2
<h3>Title 3</h3>
<b>Lorem ipsum 3</b>
<h1>Title 4</h1>
Lorem ipsum 4
<h2>Title 5</h2>
Lorem ipsum 5
</div>
_
Nokogiri::HTML(html)
.css("div")
.children
.reject{|e| e.name =~ /\Ah\d\z/}
.map{|e| e.to_html.strip}.reject(&:empty?)
result:
[
"Lorem ipsum 1",
"Lorem ipsum 2",
"<b>Lorem ipsum 3</b>",
"Lorem ipsum 4",
"Lorem ipsum 5"
]
Upvotes: 4