Reputation: 49

Parsing text from webpage using htmlagility

Trying to scrape a text from a web page with little result so far. I am trying to use HtmlAgility for this.

The source from the webpage looks like this

<div class="state info">
        <h4 class="member-states parse"><span class="trim">Nebraska NE</span></h3>

I just want to retrieve the text "Nebraska NE" and leave everything else from the webpage out. Is there a simple way to go about this?

Upvotes: 1

Answers (3)

BRAHIM Kamel

Reputation: 13794

here a sample on how you can get this

  HtmlWeb htmlWeb = new HtmlWeb();
  MemoryStream ms = new MemoryStream();
        XmlTextWriter xmlTxtWriter = new XmlTextWriter(ms, Encoding.ASCII);            
       htmlWeb.LoadHtmlAsXml(uriofhtmlPageToload, xmlTxtWriter);
        ms.Position = 0;
        XDocument xdoc = XDocument.Load(ms);
        XElement xHtml = xdoc.Root;
        string nameSpace = "{" + xdoc.Root.GetDefaultNamespace().ToString() + "}";
        XElement xBody = xHtml.Element(nameSpace + "body");
        List<XElement> xBodyElts = xBody.Descendants().ToList();
        string elt = string.Empty;
        foreach (var eltPage in xBodyElts)
        {
            if (eltPage.Name == nameSpace + "div")
            {
                if (eltPage.Attribute("class") != null && eltPage.Attribute("class").Value == "page")
                {

                    foreach (XElement eltBlockh4 in eltPage.Descendants(nameSpace + "h4"))
                    {
                        foreach (XElement eltBlockspan in eltBlockh4.Descendants(nameSpace + "span"))
                        {
                            if (eltBlockspan.Attribute("class") != null && eltBlockspan.Attribute("class").Value == "trim")
                            {
                                elt = eltBlockspan.Value;
                            }
                        }
                    }
                }
            }
        }

Upvotes: 1

Venkateshwaran Selvaraj

Reputation: 1785

Using Beautiful soup, very easy to traverse through the code.

Here is a simple code

from bs4 import BeautifulSoup
soup = BeautifulSoup('<div class="state info"> <h4 class="member-states parse"><span class="trim">Nebraska NE</span></h3>')
print soup.text

Prints

 Nebraska NE

It is just my suggestion if you are seeking any other way to web scraping.

Upvotes: 1

Marco

Reputation: 23945

You could do it like this:

HtmlDocument doc = new HtmlDocument();
 doc.Load("path/to/html");
 //select each span which class contains 'trim'
 foreach(HtmlNode span in doc.DocumentElement.SelectNodes("//span[contains(@class,'trim')]")
 {
    //add the Text by assigning it using 'span.InnerText'
 }

If this text only appears once you can simple assign it to a string, if it appears more then once store it in a collection like a List<string>

Upvotes: 1

Parsing text from webpage using htmlagility

Answers (3)

Related Questions