Reputation: 2378
I wrote a program to crawl website to get data and output to a excel sheet. The program is written in C# using Microsoft Visual Studio 2010.
For most of the time, I have no problem getting content from the website, parse it, and store data in excel.
However, once a will I'll run into issue, saying that there are illegal characters (such as ▶
) that prevents outputting to excel file, which crashes the program.
I also went onto the website manually and found other illegal characters such as Ú
.
I tried to do a .Replace()
but the code can't seem to find those characters.
string htmlContent = getResponse(url); //get full html from given url
string newHtml = htmlContent.Replace("▶", "?").Replace("Ú", "?");
So my question is, is there a way to strip out all characters of those types from a html string? (the html of the web page) Below is the error message I got.
I tried Anthony and woz's solution and that didn't work...
Upvotes: 1
Views: 3458
Reputation: 2378
thank you for the replies and thanks for the help.
After couple more hours of googling I have found the solution to my question. The problem was that I had to "sanitize" my html string.
http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/
Above is the helpful article I found, which also provides code example.
Upvotes: 1
Reputation: 9571
See System.Text.Encoding.Convert
Example usage:
var htmlText = // get the text you're trying to convert.
var convertedText = System.Text.Encoding.ASCII.GetString(
System.Text.Encoding.Convert(
System.Text.Encoding.Unicode,
System.Text.Encoding.ASCII,
System.Text.Encoding.Unicode.GetBytes(htmlText)));
I tested this with the string ▶Hello World
and it gave me ?Hello World
.
Upvotes: 2
Reputation: 10994
You could try stripping all non-ASCII characters.
string htmlContent = getResponse(url);
string newHtml = Regex.Replace(htmlContent, @"[^\u0000-\u007F]", "?");
Upvotes: 1