Reputation: 4221
I'm using WebBrowser to get source of html pages . Our page source have some text and some html tags . like this :
FONT></P><P align=center><FONT color=#ccffcc size=3>**Hello There , This is a text in our html page** </FONT></P><P align=center> </P>
Html tags are random and we can not guess them . So is there any way to get texts only and separating them from html tags ?
Upvotes: 5
Views: 18554
Reputation: 11
Using Delphi HTML Component Library getting text only from HTML document is simple. THtDocument.InnerText property returns formatted text without tags.
Upvotes: 1
Reputation: 24463
In essence: in general you can't.
HTML is a markup language with such a wide use and mind boggling possibilities to change the content dynamically that it is virtually impossible to do this (just look at how hard the web browser suppliers need to work to pass for instance the acid tests). So you can only do a subset.
For specific and well defined subsets of HTML, then you have a better chance:
First you need to get the HTML in a string, then parse that HTML.
Getting the HTML can be done for instance using Indy (see answers to this question).
Parsing highly depends on your HTML and can be quite complex, you can try this question or this search.
You could use TWebBrowser as RRuz suggests, but it depends on Internet Explorer.
Modern Windows systems do not guarantee that Internet Explorer is installed any more...
--jeroen
Upvotes: 1
Reputation: 136381
you can use a TWebBrowser instance to parse and select the plaint text from html code.
see this sample
uses
MSHTML,
SHDocVw,
ActiveX;
function GetPlainText(Const Html: string): string;
var
DummyWebBrowser: TWebBrowser;
Document : IHtmlDocument2;
DummyVar : Variant;
begin
Result := '';
DummyWebBrowser := TWebBrowser.Create(nil);
try
//open an blank page to create a IHtmlDocument2 instance
DummyWebBrowser.Navigate('about:blank');
Document := DummyWebBrowser.Document as IHtmlDocument2;
if (Assigned(Document)) then //Check the Document
begin
DummyVar := VarArrayCreate([0, 0], varVariant); //Create a variant array to write the html code to the IHtmlDocument2
DummyVar[0] := Html; //assign the html code to the variant array
Document.Write(PSafeArray(TVarData(DummyVar).VArray)); //set the html in the document
Document.Close;
Result :=(Document.body as IHTMLBodyElement).createTextRange.text;//get the plain text
end;
finally
DummyWebBrowser.Free;
end;
end;
Upvotes: 9
Reputation: 16616
If your asterisk is constant, you can simply get everychar between **
.
If your asterisk is not constant you can rewrite this string and erase all tags (things who starting from <
and ending with >
. Or you can use some DOM parser library for it.
Upvotes: 1