Reputation: 67
I need to extract nodes from html (not inner text so I can preserve the format for further manual investigation). I wrote the below code. But because how traverse works, I got duplicates in the new html file.
This is the real html to parse. http://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm
Basically I need to extract Item10 and part between "Executive Officers of the Registrant" to the next Item. Item 10 is in all documents, but "Executive Officers of the Registrant" is not in all documents. I need to get the nodes rather than just text because I want to preserve the tables, so in my next step I can parse tables in these sections if there are any.
Sample html:
html = "
<BODY>
<P>Dont need this </P>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"
I want to get:
html = "
<BODY>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"
Start to extract when the start_keyword appears. End to extract when the end_keyword appears.
There are multiple sections I need to extract from one html. The keywords can appear in nodes with different names.
doc.at_css('body').traverse do |node|
inMySection = false
if node.text.match(/#{start_keyword}/)
inMySection = true
elsif node.text.match(/#{end_keyword}/)
inMySection = false
end
if inMySection
#Extract the nodes
end
end
I also tried to use xpath to achieve this without success after referring to these posts:
XPath axis, get all following nodes until
XPath to find all following siblings up until the next sibling of a particular type
Upvotes: 2
Views: 453
Reputation: 3945
It's not a problem with Nokogiri but your algorithm. You've put your flag inMySection
inside your loop, that means at each step you set it again to false
and you lose if it was previously set to true.
Based on your sample HTML input and output, the following snippet works:
nodes = Nokogiri::HTML(html)
inMySection = false
nodes.at_xpath('//body').traverse do |node|
if node.text.match(/Start/)
inMySection = true
elsif node.text.match(/End/)
inMySection = false
end
node.remove unless inMySection
end
print nodes
Upvotes: 1