Reputation: 8424
Hello I am creating a simple console application in vb.net in order to convert a file from any type to utf8 but i can't figure out how this thing works with the encoding. I know that the source file is in Unicode, but when i convert it to a new format i get junk. Any suggestions? i am not sure if my code is correct
this is my code.
Imports System.IO
Imports System.Text
Module Module1
Sub Main()
Console.Write("Please give the filepath (example:c:/tesfile.csv):")
Dim filepath As String = Console.ReadLine()
Dim sEncoding As String = DetermineFileType(filepath)
Dim strContents As String
Dim strEncodedContents As String
Dim objReader As StreamReader
Dim ErrInfo As String
Dim bString As Byte()
Try
'Read the file
objReader = New StreamReader(filepath)
'Read untill the end
strContents = objReader.ReadToEnd()
'Close The file
objReader.Close()
'Write Contents on DOS
Console.WriteLine(strContents)
Console.WriteLine("")
bString = EncodeString(strContents, "UTF-8")
strEncodedContents = System.Text.Encoding.UTF8.GetString(bString)
Dim objWriter As New System.IO.StreamWriter(filepath.Replace(".csv", "_encoded.csv"))
objWriter.WriteLine(strEncodedContents)
objWriter.Close()
Console.WriteLine("Encoding Finished")
Catch Ex As Exception
ErrInfo = Ex.Message
Console.WriteLine(ErrInfo)
End Try
Console.ReadKey()
End Sub
Public Function DetermineFileType(ByVal aFileName As String) As String
Dim sEncoding As String = String.Empty
Dim oSR As New StreamReader(aFileName, True)
oSR.ReadToEnd()
' Add this line to read the file.
sEncoding = oSR.CurrentEncoding.EncodingName
Return sEncoding
End Function
Function EncodeString(ByRef SourceData As String, ByRef CharSet As String) As Byte()
'get a byte pointer To the source data
Dim bSourceData As Byte() = System.Text.Encoding.Unicode.GetBytes(SourceData)
'get destination encoding
Dim OutEncoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(CharSet)
'Encode the data To destination code page/charset
Return System.Text.Encoding.Convert(OutEncoding, System.Text.Encoding.UTF8, bSourceData)
End Function
End Module
Upvotes: 0
Views: 10862
Reputation: 12028
StreamReader has a constructor that takes an Encoding if you know the encoding of the file you should pass that into the constructor of StreamReader
objReader = New StreamReader(filepath, Encoding.UTF32)
You say in a comment that the file is Encoded as UCS-2 from Wikipedia
The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.
In which case you can try to decode using UTF-16 which is called Unicode with in System.Text.Encoding so try
objReader = New StreamReader(filepath, Encoding.Unicode)
FYI Unicode is a standard which has a variety of encodings including
For Microsoft to call UTF-16 Unicode is a little misleading but not inaccurate, UTF-16 is one encoding possible for Unicode.
Upvotes: 1
Reputation: 941377
StreamReader already assumes utf-8 encoding if you don't specify it in the constructor call. So re-encoding it to utf-8 cannot solve your problem. Use the StreamReader(String, Encoding) overload and specify the encoding that was used when the file was created. If you have no clue what it might be then Enoding.Default is usually the best guess. Talk to the programmer that wrote the code for the .csv file creator to be sure. When you get it right, you don't need this code anymore either.
Upvotes: 1