Guria Doll
Guria Doll

Reputation: 41

How to get all Display text from a webpage in C#

Hi I am working on data scraping application in C#.

Actually I want to get all the Display text but not the html tags.

Here's My code

HtmlWeb web  = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.
   Load(@"http://dawateislami.net/books/bookslibrary.do#!section:bookDetail_521.tr");
string str =  doc.DocumentNode.InnerText;

This inner html is returning some tags and scripts as well but I want to only get the Display text that's visible to user. Please help me. Thanks

Upvotes: 4

Views: 8928

Answers (3)

Ghasem
Ghasem

Reputation: 15573

[I believe this will solve ur problem][1]

Method 1 – In Memory Cut and Paste

Use WebBrowser control object to process the web page, and then copy the text from the control…

Use the following code to download the web page: Collapse | Copy Code

//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed   
wb.DocumentCompleted +=
    new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;

Use the following event code to process the downloaded web page text: Collapse | Copy Code

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}

Method 2 – In Memory Selection Object

This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that. Collapse | Copy Code

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{   //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}

Method 3 – The Elegant, Simple, Slower XmlDocument Approach

A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.

The XmlDocument object will load / process HTML files with only 3 simple lines of code: Collapse | Copy Code

XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;

There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved. Packages

Upvotes: 2

Avishek
Avishek

Reputation: 1896

For removing all html tags from a string you can use:

String output = inputString.replaceAll("<[^>]*>", "");

For removing a specific tag:

String output = inputString.replaceAll("(?i)<td[^>]*>", "");

Hope it helps :)

Upvotes: 0

junichiro
junichiro

Reputation: 5482

To remove javascript and css:

foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    style.Remove();

To remove comments (untested):

foreach(var comment in doc.DocumentNode.Descendants("//comment()").ToArray())
    comment.Remove()

Upvotes: 0

Related Questions