nostriel
nostriel

Reputation: 37

Fetch the content of a web page with DELPHI

I am trying to retrieve the <table><tbody> section of this page:

http://www.mfinante.ro/infocodfiscal.html?captcha=null&cod=18505138

I am using Delphi XE7.

I tried using IXMLHttpRequest, WinInet (InternetOpenURL(), InternetReadFile()), TRestClient/TRestRequest/TRestResponse, TIdHTTP.Get(), but all they retrieve is some gibberish, like this:

<html><head><meta http-equiv="Pragma" content="no-cache"/>'#$D#$A'<meta http-equiv="Expires" content="-1"/>'#$D#$A'<meta http-equiv="CacheControl" content="no-cache"/>'#$D#$A'<script>'#$D#$A'(function(){p={g:"0119a4477bb90c7a81666ed6496cf13b5aad18374e35ca73f205151217be1217a93610c5877ece5575231e088ff52583c46a8e8807483e7185307ed65e",v:"87696d3d40d846a7c63fa2d10957202e",u:"1",e:"1",d:"1",a:"challenge etc.

Look at this code for example:

program htttpget;

{$APPTYPE CONSOLE}
{$R *.res}

uses
  SysUtils, HTTPApp, IdHTTP, ActiveX;

var
  CoResult: Integer;
  HTTP: TIdHTTP;
  Query: String;
  Buffer: String;
begin
  try
    CoResult := CoInitializeEx(nil, COINIT_MULTITHREADED);
    if not((CoResult = S_OK) or (CoResult = S_FALSE)) then
    begin
      Writeln('Failed to initialize COM library.');
      Exit;
    end;
    HTTP := TIdHTTP.Create;
    Query := 'http://www.mfinante.ro/infocodfiscal.html?captcha=null' +
             '&cod=18505138';
    Buffer := HTTP.Get(Query);
    writeln(Buffer);
    HTTP.Destroy;
  except
  end;
end.

What is wrong with this page? I haven not done very many "get" functions in my life, but other websites return normal responses. Can someone at least clarify to me why this isn't working?

Are there other ways to get the content of this web page? Are there other programming languages (Java, scripting, etc) that can do this without third party software (like using Firefox source code to emulate a browser, fetch the page, without showing the window, and then copy the content).

Upvotes: 1

Views: 6203

Answers (2)

RaelB
RaelB

Reputation: 3481

You can use TWebBrowser for this.

See this post: How can I get HTML source code from TWebBrowser

The answer by RRUZ, which you can find in many places on the internet, is not what you are looking for. This gives you are original html source, as would IdHttp.Get().

However, the answer by Mehmet Fide will give you the HTML source of the DOM, which is what you are looking for.

I offer a variation here. (It includes some hacks that were required at the time to get full DOCTYPE. Not sure if they are still needed...)

function EndStr(const S: String; const Count: Integer): String;
var
  I: Integer;
  Index: Integer;
begin
  Result := '';
  for I := 1 to Count do
  begin
    Index := Length(S)-I+1;
    if Index > 0 then
      Result := S[Index] + Result;
  end;
end;

function GetHTMLDocumentSource(WebBrowser: TWebBrowser; var Charset: String):
    String;
var
  Element: IHTMLElement;
  Node: IHTMLDomNode;
  Document: IHTMLDocument2;
  I: Integer;
  S: String;
begin
  Result := '';
  Document := WebBrowser.Document as IHTMLDocument2;

  For I := 0 to Document.all.length -1 do
  begin
    Element := Document.all.item(I, 0) as IHTMLElement;
    If Element.tagName = '!' Then
    begin
      Node := Element as IHTMLDomNode;
      If (Node <> nil) and (Pos('CTYPE', UpperCase(Node.nodeValue)) > 0) Then
      begin
        S := VarToStr(Node.nodeValue);  { don't change case of result }
        if Copy(Uppercase(S), 1, 5) = 'CTYPE' then
          S := 'DO' + S;
        if Copy(Uppercase(S), 1, 7) = 'DOCTYPE' then
          S := '<!' + S;
        if Uppercase(S) = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//E' then
          S := S +'N">';

        if EndStr(Lowercase(S), 3) = '.dt' then
          S := S + 'd"';
        if EndStr(Lowercase(S), 5) = '.dtd"' then
          S := S + '>';

        Result := Result + S;
      end;
    end
    Else
      Result := Result + Element.outerHTML;

    If Element.tagName = 'HTML' Then
      Break;
  end;
  Charset := Document.charset;
end;

So call WebBrowser.Navigate(URL), then in OnDocumentComplete event retrieve the Html Source.

However, with your URL you will see the OnDocumentComplete event fires twice :(, so you need to get the Html from the last fire.

You can refer to this post How do I avoid the OnDocumentComplete event for embedded iframe elements? for info on how to get the final OnDocumentComplete event. However, I tried it and it was not working for me. You may need to use some other strategy to get the last event.

Not sure of your needs, but you may also optimize this process by disabling WebBrowser from downloading images. I believe that is possible.

Upvotes: 2

David Heffernan
David Heffernan

Reputation: 612784

This is normal, you have indeed retrieved the content correctly. What happens in your browser is that the script is executed and the page gets built client side. If you wish to replicate that in your code, then you will need to do the same. Execute the script exactly as the browser would.

What you are really looking for here is what is known as a headless browser. Integrate one of those into your program. Then get the headless browser to process the request, including executing scripts. When it has done executing scripts, read the modified content of the page.

Upvotes: 3

Related Questions