Reputation: 897

How to get the html code of the page opened in a WebBrowser in the correct encoding?

I try to get the Html code of the page open in "webBrowser".

public string GetHTMLCodPage()
{
    string htmlCodPage;
    htmlCodPage = webBrowser1.DocumentText;

    return htmlCodPage;
}

I get the code (showing the code snippet)

  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="ru">
<head>
    <title>���������, ����������� �����, ����������� �����, ����������� ������ - C# - ����������</title>
    <link rel="canonical" href="http://www.cyberforum.ru/csharp-beginners/thread2385183.html" />

    <base href="http://www.cyberforum.ru/" />
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251" />

<meta name="keywords" content="C#, ���������, ����������� �����, ����������� �����, ����������� ������" />
<meta name="description" content="������: ���������, ����������� �����, ����������� �����, ����������� ������ C# �����" />

Question
How to get the html code of the page opened in the WebBrowser in the correct encoding?

Upvotes: 1

Answers (1)

Jimi

Reputation: 32223

The (let's call it) standard way is to read the WebBrowser.DocumentStream instead of the transcoded DocumentText.

Then, use the internal encoding (the page Content-Type charset), provided by the WebBrowser.Document.Encoding property and use this encoding to read the Stream.

Use this code when the WebBrowser.Document is loaded completely, subscribing to the WebBrowser.DocumentCompleted event and waiting until the WebBrowser1.ReadyState = WebBrowserReadyState.Complete.

In the sample code, the encoded text is sent to a TextBox control.
It's just an example. Do whatever you want with it. But be aware that the DocumentCompleted event may be raised multiple times.

//Somewhere...
webBrowser1.Navigate("http://www.cyberforum.ru/");


private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    if (webBrowser1.ReadyState != WebBrowserReadyState.Complete) return;

    string decodedText = string.Empty;
    var htmlStream = webBrowser1.DocumentStream;
    var pageEncoding = Encoding.GetEncoding(webBrowser1.Document.Encoding);

    using (StreamReader destReader = new StreamReader(htmlStream, pageEncoding)) {
        decodedText = destReader.ReadToEnd();
    };
    textBox1.Text = decodedText;
}

Now the text of the page content is using the correct encoding:

<meta name="keywords" content="форум программистов, компьютерный форум, киберфорум,(...)" />
<meta name="description" content="КиберФорум - форум программистов (...)" />

Upvotes: 2

How to get the html code of the page opened in a WebBrowser in the correct encoding?

Answers (1)

Related Questions