Gulbahar
Gulbahar

Reputation: 5537

Converting UTF8 to ANSI?

I'd like to download a webpage using .Net's WebClient class, extract the title (i.e. what's between <title> and </title>) and save the page to a file.

The problem is, the page is encoded in UTF-8 and the System.IO.StreamWriter throws an exception when using a filename with such characters.

I've googled and tried several ways to convert UTF8 to ANSI, to no avail. Does someone have working code for this?

'Using WebClient asynchronous downloading
Private Sub AlertStringDownloaded(ByVal sender As Object, 
                                  ByVal e As DownloadStringCompletedEventArgs)
    If e.Cancelled = False AndAlso e.Error Is Nothing Then
        Dim Response As String = CStr(e.Result)

        'Doesn't work               
        Dim resbytes() As Byte = Encoding.UTF8.GetBytes(Response)
        Response = Encoding.Default.GetString(Encoding.Convert(Encoding.UTF8, 
                                              Encoding.Default, resbytes))

        Dim title As Regex = New Regex("<title>(.+?) \(", 
                                       RegexOptions.Singleline)
        Dim m As Match
        m = title.Match(Response)
        If m.Success Then
            Dim MyTitle As String = m.Groups(1).Value

            'Illegal characters in path.
            Dim objWriter As New System.IO.StreamWriter("c:\" & MyTitle & ".txt")
            objWriter.Write(Response)
            objWriter.Close()
        End If
    End If
End Sub

Edit: Thanks everyone for the help. It turns out the error was not due to UTF8 but rather a hidden LF character in title section of the page, which is obviously an illegal character in a path.


Edit: Here's a simple way to remove some of the illegal characters in a filename/path:

Dim MyTitle As String = m.Groups(1).Value
Dim InvalidChars As String = New String(Path.GetInvalidFileNameChars()) + New String(Path.GetInvalidPathChars())
For Each c As Char In InvalidChars
    MyTitle = MyTitle.Replace(c.ToString(), "")
Next

Edit: And here's how to tell WebClient to expect UTF-8:

Dim webClient As New WebClient
AddHandler webClient.DownloadStringCompleted, AddressOf AlertStringDownloaded
webClient.Encoding = Encoding.UTF8
webClient.DownloadStringAsync(New Uri("www.acme.com"))

Upvotes: 1

Views: 3003

Answers (1)

Steve
Steve

Reputation: 7271

I don't think the problem is related to UTF-8. I think your regex will include </title> if it appears on the same line. The characters<> are invalid in a Windows filename.

If this is not the problem it would be helpful to see some sample input and output values of MyTitle.

Upvotes: 1

Related Questions