iCantSeeSharp
iCantSeeSharp

Reputation: 3880

WebBrowser DocumentText encoding

I have come across something strange and I'd like your opinion.

There is a webpage which contains a span element with some greek text in the InnerText and InnerHtml attributes.

The encoding of the page is Greek(Windows).

My if statement is:

if (mySpan != null && mySpan.InnerText.Contains(greekText))

This line works 100%, but my previous non-working code was:

if (mySpan != null && browser.DocumentText.Contains(greekText))

This line did not work, and when I clicked on the preview withing the debugger I noticed that the greek text was unreadable (strange symbols instead of greek characters). However, all of other elements that contained greek text were successfully read by the application, that is I could save their attributes in variables and use them. Is there any explanation why DocumentText failed and InnerText succeeded?

Upvotes: 2

Views: 4337

Answers (1)

wal
wal

Reputation: 17719

Looking at the source for WebBrowser.DocumentText it would appear it uses UTF8 Encoding by default:

public string DocumentText
{
  get
  {
    Stream documentStream = this.DocumentStream;
    if (documentStream == null)
      return "";
    StreamReader streamReader = new StreamReader(documentStream);
    documentStream.Position = 0L;
    return streamReader.ReadToEnd();
  }

That is, using a StreamReader without specifying an encoding will assume UTF8 Encoding.

See this link for getting around this issue

I can only assume using browser.Document.GetElementById(mySpanId) respects the stated encoding of the page which is why you see it correctly when using this call.

Upvotes: 2

Related Questions