Kermia
Kermia

Reputation: 4221

How to get the "Text" of a html page ? (Webbrowser - Delphi)

I'm using WebBrowser to get source of html pages . Our page source have some text and some html tags . like this :

FONT></P><P align=center><FONT color=#ccffcc size=3>**Hello There , This is a text in our html page** </FONT></P><P align=center> </P>

Html tags are random and we can not guess them . So is there any way to get texts only and separating them from html tags ?

Upvotes: 5

Views: 18554

Answers (5)

Alexander Sviridenkov
Alexander Sviridenkov

Reputation: 11

Using Delphi HTML Component Library getting text only from HTML document is simple. THtDocument.InnerText property returns formatted text without tags.

Upvotes: 1

Jeroen Wiert Pluimers
Jeroen Wiert Pluimers

Reputation: 24463

In essence: in general you can't.

HTML is a markup language with such a wide use and mind boggling possibilities to change the content dynamically that it is virtually impossible to do this (just look at how hard the web browser suppliers need to work to pass for instance the acid tests). So you can only do a subset.

For specific and well defined subsets of HTML, then you have a better chance:

First you need to get the HTML in a string, then parse that HTML.

Getting the HTML can be done for instance using Indy (see answers to this question).

Parsing highly depends on your HTML and can be quite complex, you can try this question or this search.

You could use TWebBrowser as RRuz suggests, but it depends on Internet Explorer.
Modern Windows systems do not guarantee that Internet Explorer is installed any more...

--jeroen

Upvotes: 1

RRUZ
RRUZ

Reputation: 136381

you can use a TWebBrowser instance to parse and select the plaint text from html code.

see this sample

uses
MSHTML,
SHDocVw,
ActiveX;

function GetPlainText(Const Html: string): string;
var
DummyWebBrowser: TWebBrowser;
Document       : IHtmlDocument2;
DummyVar       : Variant;
begin
   Result := '';
   DummyWebBrowser := TWebBrowser.Create(nil);
   try
     //open an blank page to create a IHtmlDocument2 instance
     DummyWebBrowser.Navigate('about:blank');
     Document := DummyWebBrowser.Document as IHtmlDocument2; 
     if (Assigned(Document)) then //Check the Document
     begin
       DummyVar      := VarArrayCreate([0, 0], varVariant); //Create a variant array to write the html code to the  IHtmlDocument2
       DummyVar[0]   := Html; //assign the html code to the variant array
       Document.Write(PSafeArray(TVarData(DummyVar).VArray)); //set the html in the document
       Document.Close;
       Result :=(Document.body as IHTMLBodyElement).createTextRange.text;//get the plain text
     end;
   finally
     DummyWebBrowser.Free;
   end;
end;

Upvotes: 9

Svisstack
Svisstack

Reputation: 16616

If your asterisk is constant, you can simply get everychar between **. If your asterisk is not constant you can rewrite this string and erase all tags (things who starting from < and ending with >. Or you can use some DOM parser library for it.

Upvotes: 1

irishbuzz
irishbuzz

Reputation: 2460

You should look at using the Delphi DOM HTML parser

Upvotes: 2

Related Questions