Arne Janning
Arne Janning

Reputation: 21

Getting the HTML source from a WPF-WebBrowser-Control using IPersistStreamInit

I am trying to get the HTML source of a webpage that has been loaded into a WPF WebBrowser control. The only way to do this seems to be casting the instance of WebBrowser.Document to IPersistStreamInit (which I will have to define myself, as it is a COM interface) and call the IPersistStreamInit.Save method, passing an implementation of an IStream (again, a COM interface), which will persist the document to the stream. Well, sort of: I am always getting the first 4 kilobytes of the stream, not the entire document and I don't know why.

Here's the code of IPersistStreamInit:

using System;
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;
using System.Security;

namespace PayPal.SkyNet.BpiTool.Interop
{
    [ComImport, InterfaceType(ComInterfaceType.InterfaceIsIUnknown), 
        SuppressUnmanagedCodeSecurity, 
        Guid("7FD52380-4E07-101B-AE2D-08002B2EC713")]
    public interface IPersistStreamInit
    {
        void GetClassID(out Guid pClassID);
        [PreserveSig]
        int IsDirty();
        void Load([In, MarshalAs(UnmanagedType.Interface)] IStream pstm);
        void Save([In, MarshalAs(UnmanagedType.Interface)] IStream pstm, [In, MarshalAs(UnmanagedType.Bool)] bool fClearDirty);
        void GetSizeMax([Out, MarshalAs(UnmanagedType.LPArray)] long pcbSize);
        void InitNew();
    }
}

Here's the code of the IStream-Implementation:

using System;
using System.IO;
using System.Runtime.InteropServices.ComTypes;

namespace PayPal.SkyNet.BpiTool.Interop
{
    public class ComStream : IStream
    {
        private Stream _stream;

        public ComStream(Stream stream)
        {
            this._stream = stream;
        }

        public void Commit(int grfCommitFlags)
        {
        }

        public void CopyTo(IStream pstm, long cb, IntPtr pcbRead, IntPtr pcbWritten)
        {
        }

        public void LockRegion(long libOffset, long cb, int dwLockType)
        {
        }

        public void Read(byte[] pv, int cb, IntPtr pcbRead)
        {
            this._stream.Read(pv, (int)this._stream.Position, cb);
        }

        public void Revert()
        {
        }

        public void SetSize(long libNewSize)
        {
            this._stream.SetLength(libNewSize);
        }

        public void Stat(out System.Runtime.InteropServices.ComTypes.STATSTG pstatstg, int grfStatFlag)
        {
            pstatstg = new System.Runtime.InteropServices.ComTypes.STATSTG();
        }

        public void UnlockRegion(long libOffset, long cb, int dwLockType)
        {
        }

        public void Write(byte[] pv, int cb, IntPtr pcbWritten)
        {
            this._stream.Write(pv, 0, cb);
        }

        public void Clone(out IStream outputStream)
        {
            outputStream = null;
        }

        public void Seek(long dlibMove, int dwOrigin, IntPtr plibNewPosition)
        {
            this._stream.Seek(dlibMove, (SeekOrigin)dwOrigin);
        }
    }
}

Now I have a class to wrap it all up. As I don't want to redistribute the mshtml-interop-assembly I chose late-binding - and as late binding is easier in VB I did it in VB. Here's the code:

Option Strict Off
Option Explicit Off

Imports System.IO

Public Class HtmlDocumentWrapper : Implements IDisposable

    Private htmlDoc As Object

    Public Sub New(ByVal htmlDoc As Object)
        Me.htmlDoc = htmlDoc
    End Sub

    Public Property Document As Object
        Get
            Return Me.htmlDoc
        End Get
        Set(value As Object)
            Me.htmlDoc = Nothing
            Me.htmlDoc = value
        End Set
    End Property

    Public ReadOnly Property DocumentStream As Stream
        Get
            Dim str As Stream = Nothing
            Dim psi As IPersistStreamInit = CType(Me.htmlDoc, IPersistStreamInit)
            If psi IsNot Nothing Then
                str = New MemoryStream
                Dim cStream As New ComStream(str)
                psi.Save(cStream, False)
                str.Position = 0
            End If
            Return str
        End Get
    End Property
End Class

Now I should be able to use all this:

private void Browser_Navigated(object sender, NavigationEventArgs e)
{
    HtmlDocumentWrapper doc = new HtmlDocumentWrapper();
    doc.Document = Browser.Document;
    using (StreamReader sr = new StreamReader(doc.DocumentStream))
    {
        using (StreamWriter sw = new StreamWriter("test.txt"))
        {
            //BOOM! Only 4kb of HTML source
            sw.WriteLine(sr.ReadToEnd());
            sw.Flush();
        }
    }
}

Anybody knows, why I don't get the entire HTML souce? Any help is greatly appreciated.

Regards

Arne

Upvotes: 2

Views: 3589

Answers (2)

Jesper Larsen-Ledet
Jesper Larsen-Ledet

Reputation: 6733

Move your code from Browser.Navigated to Browser.LoadCompleted as Sheng Jiang correctly notes above and it works

Upvotes: 2

aquaherd
aquaherd

Reputation: 444

This is just a guess:

The stream does not have a known length, since it may still be downloading. You'll need to read it until it says EOF.

Upvotes: 0

Related Questions