Syntax Error
Syntax Error

Reputation: 1640

How to get MIME encoded base64 string back to writeable byte array?

I have a number of files (PDF, doc, docx, jpg, etc) in a database, which were added using the following method:

Attachments are sent to the database as MIME-encoded (base64-encoded) strings. The component then converts this MIME-encoded string (in unicode) to a byte stream before writing it to the database as a BLOB (Oracle) or Image (SQL Server).

There is also a 'guid' appended to the start of the string, which is 76 chars.

I'm trying to extract and save the attachments to a file on disk rather than in the database. It works about 20% of the time. The remainder of the time I get System.FormatException: Invalid character in a Base-64 string. when passing my string into FromBase64String.

I have noticed that values in the database that start like the below, save successfully:

0x7B00350030003100460032003300350046002D00370

The ones which fail always start like this:

0x7B35303146323335462D373546302D343936342D394

I haven't got enough characters here to paste a full example, so see this pastebin for an example of one that does not work. It should represent a Word document that says "Test document only". This one is the same document but converted to PDF.

This is one that will work and be converted to test font.htm. It must be inserted into an image column within a SQL database, then pulled out with my code:

 private const int guidLength = 38 * 2;
 public static byte[] GetAttachment(string folderid, string filename) {
 string queryString = string.Format("SELECT <image column> FROM AttachmentTable WHERE .....",
                      folderid, filename);
                using (SqlConnection connection = new SqlConnection("context connection=true"))
                {
                    connection.Open();
                    using (SqlCommand selectAttachment = new SqlCommand(
                        queryString,
                        connection))
                    {
                        using (SqlDataReader reader = selectAttachment.ExecuteReader())                        {
                            while (reader.Read())
                            {
                                if (reader[0] == System.DBNull.Value)
                                    return new byte[0];
                                byte[] data = (byte[])reader[0];
                                byte[] truncatedData;
                                if (data[data.Length - 2] == 0)
                                    truncatedData = new byte[data.Length - guidLength - 2];
                                else
                                    truncatedData = new byte[data.Length - guidLength];
                                Array.Copy(data, guidLength, truncatedData, 0, truncatedData.Length);
                                // base64 unencode
                                string truncatedString = Encoding.Unicode.GetString(truncatedData);
                                return Convert.FromBase64String(truncatedString);
                            }
                        }

                    }
                } 

             }

Then to save the attachment:

 public static void SaveAttachmentToFile(string file, string folderid, string fileName)
        {
            byte[] data = GetAttachment(file, folderid);
            if (data == null)
                throw new ArgumentNullException("Attachment has no data, it may have been deleted");
            using (FileStream writer = new FileStream(fileName, FileMode.Create))
            {
                writer.Write(data, 0, data.Length);
            }
        }

SQL CLR Function

   [SqlFunction(IsDeterministic = true,
                     IsPrecise = true,
                     DataAccess = DataAccessKind.Read,
                     SystemDataAccess = SystemDataAccessKind.Read)]
    public static SqlString WriteToFile(SqlString path, SqlString folderid, SqlString fileName)
    {
        try
        {
            if (!path.IsNull && !folderid.IsNull && !fileName.IsNull)
            {
                var dir = Path.GetDirectoryName(path.Value);
                if (!Directory.Exists(dir))
                    Directory.CreateDirectory(dir);
                string filename = Convert.ToString(fileName);
                string folderid = Convert.ToString(efolderid);
                string filepath = Convert.ToString(path);
                SaveAttachmentToFile(filename, folderid, filepath);
                return "Wrote file";
            }
            else
                return "No data passed to method!";
        }
        catch (IOException e)
        {
            return "Make sure the assembly has external access!\n" + e.ToString();
        }
        catch (Exception ex)
        {
            return ex.ToString();
        }
    }

Note, All the C# code above is compiled into an assembly then used as a CLR function:

CREATE FUNCTION [dbo].[WriteToFile](@path [nvarchar](max), @efolderid [nvarchar](max), @filename [nvarchar](max))
RETURNS [nvarchar](max) WITH EXECUTE AS CALLER
AS 
EXTERNAL NAME [ClassLibrary1].[CLR.UserDefinedFunctions].[WriteToFile]
GO

I think my issue might be to do with the encoding. I thought I might be able to use Encoding.MIME.GetString but it doesn't exist. I've also tried UTF-8 but get 0% success rate. Unicode seems to work but as above, ~20% success rate.

My question is, why do some of these fail to save (incorrect base64 chars.. but why?) yet others work fine? How do I determine the correct encoding to use? There is a pattern to it but I'm really not sure how to proceed from here.

Upvotes: 1

Views: 2344

Answers (1)

Nyerguds
Nyerguds

Reputation: 5639

The given data insertion method is ambiguous; "unicode" is not actually a text encoding; it is the general system of representing symbols as a number. The .Net framework indeed has an encoding called "Unicode", but this is a misnomer, this encoding is actually UTF-16.

Now, your data, as mentioned, comes in two formats; one that works, and one that doesn't work. The difference between those two formats is that one of them has 00 bytes in between each of the data bytes. This corresponds to UTF-16-LE, where all symbols are 16 bits, aka, 2 bytes, with the lowest part of the value stored in the first byte. The compact data without those 00 bytes should be plain ASCII.

This UTF-16 format is actually a pretty dumb way of saving Base64 data, since Base64, by definition, is always pure 7-bit ascii; these extra bytes will never be used, and just double the space required to save that data. In fact, when saved as bytes, there is no merit to Base64 encoding either, since the purpose of Base64 is to convert binary data to plain text so it can be handled by systems that can't handle storing/transferring binary data. Given the fact this Base64 text is then saved as Binary LOB in your database, this is clearly not the case here.

All that aside, the 00 bytes do give a solution for your problem here: as I said, for Base64 content, these in-between bytes will never be used, meaning they will always be 00. On the other hand, Base64 is always plain ASCII text and should never contain 00 bytes. This means you can check those 00 bytes and use their existence to choose the correct encoding.

Note that it is much simpler to cut off the GUID after converting the bytes to string, since then it will simply always have a length of 38, as opposed to being either 38 bytes in ASCII or 76 bytes in UTF-16.

Adapting the reader part of your first code block to this should solve the issue:

using (SqlDataReader reader = selectAttachment.ExecuteReader())
{
    // only reading one anyway; doesn't need to be a 'while'.
    if (!reader.Read())
        return new byte[0];
    if (reader[0] == System.DBNull.Value)
        return new byte[0];
    byte[] data = (byte[])reader[0];
    if (data.Length == 0)
        return new byte[0];
    String base64String
    if (data.Length > 1 && data[1] == 00)
        base64String = Encoding.Unicode.GetString(data);
    else
        base64String = Encoding.ASCII.GetString(data);
    // Cuts off the GUID, and takes care of any trailing 00 bytes.
    String truncatedString = base64String.Substring(38).TrimEnd('\0');
    return Convert.FromBase64String(truncatedString);
}

Upvotes: 1

Related Questions