Reputation: 181
I have a problem with a WebRequest
in C#. It's a google page.
The header states
text/html; charset=ISO-8859-1
The website states
<meta http-equiv=content-type content="text/html; charset=utf-8">
And finally I only get the expected Result in the debugger as well as regular expression, when I use Encoding.Default
which defaults to System.Text.SBCSCodePageEncoding
Now what do I do? Do you have any hints, how this could happen or how I could solve this problem?
The actual Encoding of the page seems to be UTF-8. At least FF displays it correctly in UTF-8, not in Windows-Whatever and not in Latin1.
The URL is this
The problem is the €-sign as well as all German Umlauts.
Thanks in advance for your help on this problem which is making me seriously crazy!
Update: when I output the string via
// create a writer and open the file
TextWriter tw = new StreamWriter("test.txt");
// write a line of text to the file
tw.WriteLine(html);
// close the stream
tw.Close();
it works all fine.
So it seems the problem is, that the debugger does not show the correct encoding, and the Regular Expression also.
How do I tell C# to handle the RegEx as UTF-8?
Upvotes: 2
Views: 776
Reputation: 21898
Rather than parsing HTML, why not use the Google Query API?
BTW, before parsing HTML using regexes, read this ;-)
EDIT: In answer to your comment:
Upvotes: 1