user1009073
user1009073

Reputation: 3238

Reading HTML content from Clipboard in Delphi

I have a webpage which has various tables on it. These tables are Javascript components, not just pure HTML tables. I need to process the text of this webpage (somewhat similar to screen scraping) with a Delphi program (Delphi 10.3).

I do a Ctrl-A/Ctrl-C to select all the webpage and copy everything to the clipboard. If I paste this into a TMemo component in my program, I am only getting text outside the table. If I paste into MS Word, I can see all the content, including the text inside the table.

I can paste this properly into TAdvRichEditor (3rd party), but it takes forever, and I often run out of memory. This leads me to believe that I need to directly read the clipboard with an HTML clipboard format.

I set up a clipboard HTML format. When I inspect the clipboard contents, I get what looks like all Kanji characters.

How do I read the contents of the clipboard when the contents are HTML?

In a perfect world, I would like ONLY the text, not the HTML itself, but I can strip that out later. Here is what I am doing now...

On initialization.. (where CF_HTML is a global variable)

CF_HTML := RegisterClipboardFormat('HTML Format');

then my routine is...

function TMain.ClipboardAsHTML: String;
var
  Data: THandle;
  Ptr: PChar;
begin
  Result := '';
  with Clipboard do
  begin
    Open;
    try
      Data := GetAsHandle(CF_HTML);
      if Data <> 0 then
      begin
        Ptr := PChar(GlobalLock(Data));
        if Ptr <> nil then
        try
          Result := Ptr;
        finally
          GlobalUnlock(Data);
        end;
      end;
    finally
      Close;
    end;
  end;
end;

** ADDITIONAL INFO - When I copy from the webpage... I can then inspect the contents of the Clipboard buffer using a free tool called InsideClipBoard. It shows that the clipboard contains 1 entry, with 5 formats: CT_TEXT, CF_OEMTEXT, CF_UNICODETEXT, CF_LOCALE, and 'HTML Format' (with Format ID of 49409). Only 'HTML Format' contains what I am looking for.... and that is what I am trying to access with the code that I have shown.

Upvotes: 3

Views: 1645

Answers (1)

David Heffernan
David Heffernan

Reputation: 612954

The HTML format is documented here. It is placed on the clipboard as UTF-8 encoded text, and you can extract it like this.

{$APPTYPE CONSOLE}

uses
  System.SysUtils,
  Winapi.Windows,
  Vcl.Clipbrd;

procedure Main;
var
  CF_HTML: Word;
  Data: THandle;
  Ptr: Pointer;
  Error: DWORD;
  Size: NativeUInt;
  utf8: UTF8String;
  Html: string;
begin
  CF_HTML := RegisterClipboardFormat('HTML Format');

  Clipboard.Open;
  try
    Data := Clipboard.GetAsHandle(CF_HTML);
    if Data=0 then begin
      Writeln('HTML data not found on clipboard');
      Exit;
    end;

    Ptr := GlobalLock(Data);
    if not Assigned(Ptr) then begin
      Error := GetLastError;
      Writeln('GlobalLock failed: ' + SysErrorMessage(Error));
      Exit;
    end;
    try
      Size := GlobalSize(Data);
      if Size=0 then begin
        Error := GetLastError;
        Writeln('GlobalSize failed: ' + SysErrorMessage(Error));
        Exit;
      end;

      SetString(utf8, PAnsiChar(Ptr), Size - 1);
      Html := string(utf8);
      Writeln(Html);
    finally
      GlobalUnlock(Data);
    end;
  finally
    Clipboard.Close;
  end;
end;

begin
  try
    Main;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  Readln;
end.

Upvotes: 11

Related Questions