Reputation: 49
Trying to scrape a text from a web page with little result so far. I am trying to use HtmlAgility for this.
The source from the webpage looks like this
<div class="state info">
<h4 class="member-states parse"><span class="trim">Nebraska NE</span></h3>
I just want to retrieve the text "Nebraska NE" and leave everything else from the webpage out. Is there a simple way to go about this?
Upvotes: 1
Views: 449
Reputation: 13755
here a sample on how you can get this
HtmlWeb htmlWeb = new HtmlWeb();
MemoryStream ms = new MemoryStream();
XmlTextWriter xmlTxtWriter = new XmlTextWriter(ms, Encoding.ASCII);
htmlWeb.LoadHtmlAsXml(uriofhtmlPageToload, xmlTxtWriter);
ms.Position = 0;
XDocument xdoc = XDocument.Load(ms);
XElement xHtml = xdoc.Root;
string nameSpace = "{" + xdoc.Root.GetDefaultNamespace().ToString() + "}";
XElement xBody = xHtml.Element(nameSpace + "body");
List<XElement> xBodyElts = xBody.Descendants().ToList();
string elt = string.Empty;
foreach (var eltPage in xBodyElts)
{
if (eltPage.Name == nameSpace + "div")
{
if (eltPage.Attribute("class") != null && eltPage.Attribute("class").Value == "page")
{
foreach (XElement eltBlockh4 in eltPage.Descendants(nameSpace + "h4"))
{
foreach (XElement eltBlockspan in eltBlockh4.Descendants(nameSpace + "span"))
{
if (eltBlockspan.Attribute("class") != null && eltBlockspan.Attribute("class").Value == "trim")
{
elt = eltBlockspan.Value;
}
}
}
}
}
}
Upvotes: 1
Reputation: 1785
Using Beautiful soup, very easy to traverse through the code.
Here is a simple code
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div class="state info"> <h4 class="member-states parse"><span class="trim">Nebraska NE</span></h3>')
print soup.text
Prints
Nebraska NE
It is just my suggestion if you are seeking any other way to web scraping.
Upvotes: 1
Reputation: 23917
You could do it like this:
HtmlDocument doc = new HtmlDocument();
doc.Load("path/to/html");
//select each span which class contains 'trim'
foreach(HtmlNode span in doc.DocumentElement.SelectNodes("//span[contains(@class,'trim')]")
{
//add the Text by assigning it using 'span.InnerText'
}
If this text only appears once you can simple assign it to a string, if it appears more then once store it in a collection like a List<string>
Upvotes: 1