chuckfinley
chuckfinley

Reputation: 763

Extract text between HTML tags with nokogiri

I have HTML like this:

<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>

I have a basic Nokogiri CSS node search returning <p> content but I can't find examples for how to target all text between the Nth closed H2 and the next open H2. I'm creating a CSV with the output so I would also like to read in a file list and put the URL as first result.

Upvotes: 5

Views: 5924

Answers (5)

pguardiario
pguardiario

Reputation: 54984

You can sometimes use NodeSet's & operator to get information between nodes:

doc.xpath('//h2[1]/following-sibling::p') & doc.xpath('//h2[2]/preceding-sibling::p')

Upvotes: 3

Phrogz
Phrogz

Reputation: 303234

Instead of an XPath solution, here's a simple (naïve) implementation that assumes that the start and stop elements share the same parent and allows the XPaths for start and stop to be specified independently:

HTML = "<h1>Header is here</h1>
  <h2>Header 2 is here</h2>
     <p>Extract me!</p>
     <p>Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p>Extract me three!</p>
     <p>Extract me four!</p>"

require 'nokogiri'    
class Nokogiri::XML::Node
  # Naive implementation; assumes found elements will share the same parent
  def content_between( start_xpath, stop_xpath=nil )
    node = at_xpath(start_xpath).next_element
    stop = stop_xpath && at_xpath(stop_xpath)
    [].tap do |content|
      while node && node!=stop
        content << node
        node = node.next_element
      end
    end
  end
end

html = Nokogiri::HTML(HTML)
puts html.content_between('//h2[1]','//h2[2]').map(&:content)
#=> Extract me!
#=> Extract me too!
puts html.content_between('//h2[3]').map(&:content)
#=> Extract me three!
#=> Extract me four!

Upvotes: 2

Phrogz
Phrogz

Reputation: 303234

If the start and stop elements have the same parent, this is as simple as a single XPath. First I'll show it with a simplified document for clarity, and then with your sample document:

XML = "<root>
  <a/><a1/><a2/>
  <b/><b1/><b2/>
  <c/><c1/><c2/>
</root>"

require 'nokogiri'
xml = Nokogiri::XML(XML)

# Find all elements between 'a' and 'c'
p xml.xpath('//*[preceding-sibling::a][following-sibling::c]').map(&:name)
#=> ["a1", "a2", "b", "b1", "b2"]

# Find all elements between 'a' and 'b'
p xml.xpath('//*[preceding-sibling::a][following-sibling::b]').map(&:name)
#=> ["a1", "a2"]

# Find all elements after 'c'
p xml.xpath('//*[preceding-sibling::c]').map(&:name)
#=> ["c1", "c2"]

Now, here it is with your use case (finding by index):

HTML = "<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p>Extract me!</p>
     <p>Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p>Extract me three!</p>
     <p>Extract me four!</p>"

require 'nokogiri'
html = Nokogiri::HTML(HTML)

# Find all elements between the first and second h2s
p html.xpath('//*[preceding-sibling::h2[1]][following-sibling::h2[2]]').map(&:content)
#=> ["Extract me!", "Extract me too!"]

# Find all elements between the third h2 and the end
p html.xpath('//*[preceding-sibling::h2[3]]').map(&:content)
#=> ["Extract me three!", "Extract me four!"]

Upvotes: 2

Dan Healy
Dan Healy

Reputation: 757

require 'rubygems'
require 'nokogiri'

h = '<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
'

doc = Nokogiri::HTML(h)

# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
  2...3,
  4...5
]

# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
  "h1",
  "h2"
]

extracted_text = []

i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|

  if (DELIMITER_TAGS.include? el.name)
    i += 1
  else
    extract = false
    EXTRACT_RANGES.each do |cur_range|
      if (cur_range.include? i)
        extract = true
        break
      end
    end

    if extract
      s = el.inner_text.strip
      unless s.empty?
        extracted_text << el.inner_text.strip
      end
    end
  end

end

# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n")

Upvotes: 4

user973254
user973254

Reputation:

This code may help you, but it stil needed more information about tags location (it's better if you info which needs to be extracted will be located between some tags)

require 'rubygems'
require 'nokogiri'
require 'pp'

html = '<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
';

doc = Nokogiri::HTML(html);

doc.xpath("//p").each do |el|
  pp el
end

Upvotes: 1

Related Questions