Reputation: 41
Hi I am working on data scraping application in C#.
Actually I want to get all the Display text but not the html tags.
Here's My code
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.
Load(@"http://dawateislami.net/books/bookslibrary.do#!section:bookDetail_521.tr");
string str = doc.DocumentNode.InnerText;
This inner html is returning some tags and scripts as well but I want to only get the Display text that's visible to user. Please help me. Thanks
Upvotes: 4
Views: 8928
Reputation: 15573
[I believe this will solve ur problem][1]
Method 1 – In Memory Cut and Paste
Use WebBrowser control object to process the web page, and then copy the text from the control…
Use the following code to download the web page: Collapse | Copy Code
//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;
Use the following event code to process the downloaded web page text: Collapse | Copy Code
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}
Method 2 – In Memory Selection Object
This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that. Collapse | Copy Code
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{ //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}
Method 3 – The Elegant, Simple, Slower XmlDocument Approach
A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.
The XmlDocument object will load / process HTML files with only 3 simple lines of code: Collapse | Copy Code
XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;
There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved. Packages
Upvotes: 2
Reputation: 1896
For removing all html tags from a string you can use:
String output = inputString.replaceAll("<[^>]*>", "");
For removing a specific tag:
String output = inputString.replaceAll("(?i)<td[^>]*>", "");
Hope it helps :)
Upvotes: 0
Reputation: 5482
To remove javascript and css:
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
style.Remove();
To remove comments (untested):
foreach(var comment in doc.DocumentNode.Descendants("//comment()").ToArray())
comment.Remove()
Upvotes: 0