Reputation: 4036
I have a process that reads data using the Windows 1252 codepage (input from a SQL Server 2008 varchar field). I then write this data to a flat text file, which is picked up by an IBM mainframe system that uses the EBCDIC 37 codepage. That system converts the file to its own character set. However, some things in the extended ASCII range (char codes 128 - 255) don't get converted nicely by the mainframe. I think this is because certain characters in the Windows character set do not exist in the EBCDIC character set.
Is there a general way to determine what characters I need to filter out, such as a left single quote, right single quote, left double quote, right double quote, bullet, en dash, and em dash, (Windows codes 145 - 151, respectively), to name a few? If so, is there some algorithm I can use to determine what the closest EBCDIC equivalent might be (such as a normal single quote for either a left single quote or a right single quote)?
Upvotes: 0
Views: 672
Reputation: 4036
I was looking for a general way to solve this problem instead of focusing on just EBCDIC 37, and I didn't want to visually compare two charts of codes. I wrote a short program (in VB.NET) to find all of the characters that exist in one codepage and not the other.
' Pick source and target codepages.
Dim sourceEncoding As Encoding = Encoding.Default ' This is Windows 1252 on Windows OS.
Dim targetEncoding As Encoding = Encoding.GetEncoding("IBM037")
' Get every character in the codepage.
Dim inbytes(256) As Byte
For code As Integer = 0 To 255
inbytes(code) = Convert.ToByte(code)
Next
' Convert the bytes from the source encoding to the target, then back again.
' Those bytes that convert back to the original value exist in both codepages.
' The bytes that change do not exist in the target encoding.
Dim input As String = sourceEncoding.GetString(inbytes)
Dim outbytes As Byte() = Encoding.Convert(sourceEncoding, targetEncoding, inbytes)
Dim convertedbytes As Byte() = Encoding.Convert(targetEncoding, sourceEncoding, outbytes)
Dim output As String = sourceEncoding.GetString(convertedbytes)
Dim diffs As New List(Of Char)()
For idx As Integer = 0 To input.Length - 1
If input(idx) <> output(idx) Then
diffs.Add(input(idx))
End If
Next
' Print results.
Console.WriteLine("Source: " + input)
Console.WriteLine("(Coded): " + String.Join(" ", inbytes.Select(Function (x) Convert.ToInt32(x).ToString()).ToArray()))
Console.WriteLine()
Console.WriteLine("Target: " + output)
Console.WriteLine("(Coded): " + String.Join(" ", convertedbytes.Select(Function (x) Convert.ToInt32(x).ToString()).ToArray()))
Console.WriteLine()
Console.WriteLine("Cannot convert: " + String.Join(" ", diffs.Select(Function (x) Convert.ToInt32(x).ToString()).ToArray()))
For the case of Windows 1252 to EBCDIC 37, there are 27 characters that do not map. I chose what I thought was the best equivalent for those characters.
Upvotes: 1