Looping through node created by HtmlAgilityPack

Question

I need to parse this html code using HtmlAgilityPack and C#. I can get the div class="patent_bibdata" node, but I don'know how to loop thru the child nodes.

In this sample there are 6 hrefs, but I need to separate them into two groups; Inventors, Classification. I'm not interested in the last two. There can be any number of hrefs in this div.

As you can see there is a text before the two groups that says what the hrefs are.

code snippet

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = m_hw.Load("http://www.google.com/patents/US3748943");
string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']";
HtmlNode node = m_doc.DocumentNode.SelectSingleNode(xpath);

So how would you do this?


    Inventors: 
    
    Ronald T. Lashley
    , 
    
    Ronald T. Lashley
    

    Current U.S. Classification: 
    84/312.00P;
    84/312.00R

    

    
    View patent at USPTO

    
    Search USPTO Assignment Database

Wanted result InventorGroup =


    Ronald T. Lashley
    
    
    Thomas R. Lashley

ClassificationGroup

84/312.00P;
    84/312.00R

The page I'm trying to scrape: http://www.google.com/patents/US3748943

// Anders

PS! I know that in this page the names of the inventors are the same, but in most of them they are different!

Simon Mourier · Accepted Answer

XPATH is your friend! Something like this will get you the inventors name:

HtmlWeb w = new HtmlWeb();
HtmlDocument doc = w.Load("http://www.google.com/patents/US3748943");
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='patent_bibdata']/br[1]/preceding-sibling::a"))
{
    Console.WriteLine(node.InnerHtml);
}

Looping through node created by HtmlAgilityPack

Answers (2)

Related Questions