sjsc
sjsc

Reputation: 4632

How do you extract content under header tags?

I have an html like so:

<div class="content">
  <h1>Title 1</h1>
  Lorem ipsum 1

  <h2>Title 2</h2>
  Lorem ipsum 2

  <h3>Title 3</h3>
  <b>Lorem ipsum 3</b>

  <h1>Title 4</h1>
  Lorem ipsum 4

  <h2>Title 5</h2>
  Lorem ipsum 5
</div>

I want to extract content under each header title and place them into an array like so:

[
  "Lorem ipsum 1",
  "Lorem ipsum 2",
  "<b>Lorem ipsum 3</b>",
  "Lorem ipsum 4",
  "Lorem ipsum 5"
]

How would I do that using regex and/or ruby? I tried playing around with split method, like html_body.split(">"), but still can't figure out how to do so correctly. What is the correct way to do it using regex and/or ruby?

Upvotes: 0

Views: 106

Answers (2)

sawa
sawa

Reputation: 168091

You shouldn't reinvent the wheel. Using Nokogiri is more robust than trying from scratch.

require "nokogiri"

html = <<_
<div class="content">
  <h1>Title 1</h1>
  Lorem ipsum 1

  <h2>Title 2</h2>
  Lorem ipsum 2

  <h3>Title 3</h3>
  <b>Lorem ipsum 3</b>

  <h1>Title 4</h1>
  Lorem ipsum 4

  <h2>Title 5</h2>
  Lorem ipsum 5
</div>
_

Nokogiri::HTML(html)
.css("div")
.children
.reject{|e| e.name =~ /\Ah\d\z/}
.map{|e| e.to_html.strip}.reject(&:empty?)

result:

[
  "Lorem ipsum 1",
  "Lorem ipsum 2",
  "<b>Lorem ipsum 3</b>",
  "Lorem ipsum 4",
  "Lorem ipsum 5"
]

Upvotes: 4

Amit Joki
Amit Joki

Reputation: 59232

You can use the regex

/(?<=<\/h\d>\n).*/gm

and trim the match to get the desired output.

DEMO

Upvotes: 1

Related Questions