Parsing contents of paragraph elements with Nokogiri

Question

I'd like to know the proper way to parse a block of contents with Nokogiri:

I have some documents to parse where they originally contained a format where each main container was a

. The main pieces of information within each one are divided up, oddly, with tags.

Effectively a stock sample of

contents contains the following and is a typical example (some have a lot more content, some a lot less):


  
    
      October 10, 1990 - Maybe a Title
    - 
    
      Some long text here.         
      
        [Blah Blah, October 27, 1982 p. 2
        ]
      . 
      More content. 
      [Another Source, 1971, issue 01/4]
      . 
    
    
      
        Another fantastic article. 
        [Some Source, October 4, p.6]

Essentially the "font size" attribute is what sets each component apart in the article. The main points to extract are the FIRST (that is the article date and main title, if a title is given) tags, then the actual content.



Presently I have all paragraph chunks coming out with: doc.xpath('//p').each do |node|

However I am not sure if I should pass it through Nokogiri again to parse out it's contents or if I should just run it all through a regex.  Was hoping for a small example of doing this "properly" with, I'm assuming, using an embedded xpath discovery within the initial block that pulls the elements out.  I assume that there is a way to pull out the sub components based on the font size demarcation, but I've simply not seen a specific example of this yet.

Michael Kohl · Accepted Answer

Does that help you get started?

>> doc.xpath('//p').each do |node|
..     puts node.xpath("font[@size='5']/font").first.content.strip
..   end #=> 0
October 10, 1990 - Maybe a Title

Build similar expressions for the other parts you need and you are done :-)

Parsing contents of paragraph elements with Nokogiri

Answers (1)

Related Questions