JXU
JXU

Reputation: 67

Nokogiri extract nodes from html

I need to extract nodes from html (not inner text so I can preserve the format for further manual investigation). I wrote the below code. But because how traverse works, I got duplicates in the new html file.

This is the real html to parse. http://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm

Basically I need to extract Item10 and part between "Executive Officers of the Registrant" to the next Item. Item 10 is in all documents, but "Executive Officers of the Registrant" is not in all documents. I need to get the nodes rather than just text because I want to preserve the tables, so in my next step I can parse tables in these sections if there are any.

Sample html:

html = "
<BODY>
<P>Dont need this </P>  
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"

I want to get:

html = "
<BODY>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"

Start to extract when the start_keyword appears. End to extract when the end_keyword appears.

There are multiple sections I need to extract from one html. The keywords can appear in nodes with different names.

doc.at_css('body').traverse do |node|
    inMySection  = false

    if node.text.match(/#{start_keyword}/)
        inMySection = true
    elsif node.text.match(/#{end_keyword}/)
        inMySection = false
    end
    if inMySection
        #Extract the nodes
    end
end

I also tried to use xpath to achieve this without success after referring to these posts:

XPath axis, get all following nodes until

XPath to find all following siblings up until the next sibling of a particular type

Upvotes: 2

Views: 453

Answers (1)

toch
toch

Reputation: 3945

It's not a problem with Nokogiri but your algorithm. You've put your flag inMySection inside your loop, that means at each step you set it again to false and you lose if it was previously set to true.

Based on your sample HTML input and output, the following snippet works:

nodes = Nokogiri::HTML(html)
inMySection  = false
nodes.at_xpath('//body').traverse do |node|
  if node.text.match(/Start/)
    inMySection = true
  elsif node.text.match(/End/)
    inMySection = false
  end
  node.remove unless inMySection
end
print nodes

Upvotes: 1

Related Questions