ianbroad
ianbroad

Reputation: 111

How do I convert this XML string to binary form with Python?

First of all, I'm parsing from a text file which I saved with notepad in UTF-8 encoding. Is this enough to make sure it's in UTF-8? I tried the chardet module, but it didn't really help me. Here's a few lines of the text file, if someone can find out more:

CUSTOMERLOC|1|N/A|N/A|LEGACY COPPER|N/A|Existing|N/A|NRZ|NRZ|N/A|N/A
FTSMAR08|01/A|N/A|N/A|LEGACY COPPER|N/A|Existing|N/A|NRZ|NRZ|N/A|N/A
FTSMAR08|01/B|N/A|N/A|LEGACY COPPER|N/A|Existing|N/A|NRZ|NRZ|N/A|N/A

I used the lxml module to write my XML and I used the tostring() method and assigned it to a variable called data.

I then used the a2b_qp() function of the binascii module to convert the XML string to binary and I put all of that into a bytearray.

data = bytearray(binascii.a2b_qp(ET.tostring(root, pretty_print=True)), "UTF-8")

Now in my mind, this data variable should contain my XML in binary form inside a bytearray.

So, then I used an update cursor and inserted the data into a BLOB field of the table.

row[2] = data
cursor.updateRow(row)

Everything seems to work, but when I go to read the BLOB field using this code:

with arcpy.da.SearchCursor("Point", ['BlobField']) as cursor:
    for row in cursor:
        binaryRep = row[0]
        open("C:/Blob.xml, 'wb').write(binaryRep.tobytes())

When I open the Blob.xml file, I expect to see the XML string I first created in a readable form, but I get this mess with Notepad++ set to UTF-8 encoding:

enter image description here

And this mess with Notepad++ set to ANSI encoding:

ANSI encoding

I thought someone experienced might know what's going on by seeing the pictures. I've read a lot and tried to figure it out, but I've been stumped for awhile now.

Upvotes: 3

Views: 13505

Answers (3)

lvc
lvc

Reputation: 35059

I'm parsing from a text file which I saved with notepad in UTF-8 encoding. Is this enough to make sure it's in UTF-8? I tried the chardet module, but it didn't really help me.

Yes, telling your editor to save it in a given encoding is enough to make sure it is saved in that encoding. If possible, this should also be recorded in the file somewhere - with XML, <?xml encoding="utf-8"?> is a common way to specify this - but that just metadata, and doesn't actually control the encoding. chardet is useful for when you don't know the encoding - but the kind of guesswork it does should be reserved as a last resort. UTF8 is usually a good default assumption, especially for XML.

The reason this line:

data = bytearray(binascii.a2b_qp(ET.tostring(root, pretty_print=True)), "UTF-8")

gives you nonsense is that it does some nasty stuff, and ends up with mojibake.

ET.tostring() defaults to encoding in ASCII (and will therefore lose any data that isn't ASCII-range, but that's beside the point for now). So, now you have an ASCII string. binascii.a2b_qp decodes it using the quoted printable encoding. So, it turns it from something where everything is a printable ASCII character to something where that isn't necessarily the case (qp encodes any bytes that aren't in the printable ASCII range using 3 printable ASCII characters). That means, for example, if you have anything in your text saying =00, it will turn it into a null byte. The problem is that what you had was not QP-encoded, so QP-decoding it results in nonsense.

Then you use bytearray to encode it again as UTF8. bytearray assumes that if you give it an encoding, then the string is a unicode string - you break this assumption, and give it raw binary data (which is already meaningless). Encoding raw binary data as UTF8 isn't something that particularly makes sense, and this bit leads me to believe that you are using Python 2. Python 3 properly throws an error when you try to do this:

>>> bytearray(b'123', 'utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: encoding or errors without a string argument

Python 2 is a lot murkier about what is bytes and what is decoded characters, making this problem type of problem a lot easier to run into. This is a really good reason to upgrade to Python 3 if you can. But it wouldn't have helped the previous nonsense you get out of a2b_qp (since it is a bytes<->bytes encoding).


The fix is to encode it in UTF-8 from the start, and forget about quoted-printable. (If you really do want it do be QP-encoded, run it through binascii.b2a after it is UTF8ified).

ElementTree lets you specify an encoding:

 ET.tostring(root, encoding='utf-8')

will get you properly UTF-8 encoded XML, which will open nicely in Notepad++.

Upvotes: 4

Ivo
Ivo

Reputation: 5420

Storing:

  • Have your XML data
  • serialise it as a string
  • encode that string to a UTF-8 binary string (i.e xml_string.encode('utf-8'))
  • Save the resulting binary string in your database

Retrieiving:

  • Retrieve the binary string from the database
  • Decode it from UTF-8 - xml_string.decode('utf-8')
  • Deserialize it into XML again
  • Do what you want with your XML

Upvotes: 0

Andrew Alcock
Andrew Alcock

Reputation: 19651

I think you're going off-track here:

binascii.a2b_qp(ET.tostring(root, pretty_print=True))

a2b_qp assumes the input is in 'quoted printable' (similar to base64) but it's actually XML. The result is that the binary is junk.

Instead you should use bytearray. Pass it your XML string and encoding ("utf-8") and it will return your blob.

Encodings are and interesting set of mental gymnastics. In summary:

  • If in Python 3, you're probably good. If you are using 2.x, then you almost certainly want to use the unicode datatype, not str
  • Unicode is a higher-level concept than an encoding. Every displayable character is one (or sometimes more than one) code point in a huge logical space of over a million characters.
  • Simplistically writing a Unicode string to disk would require a 3 bytes for each character. Such files would be a lot larger than they could be, and are incompatible with most existing ASCII files - this was unacceptable back in the 1990's when most data was ASCII and disk was oh-so-expensive, so an encoding (mapping) was used. UTF-8 is a good one because:
    • Backwards compatibility: All 7-but ASCII files are valid UTF-8 files
    • Efficiency: 8-bit to 14 bit characters (most of the other characters that most people use) map to 2 bytes of UTF-8. Other characters occupy 3 or 4 bytes as required
    • Compatibility: A lot of important protocols and standards use UTF-8
  • You've moved into a different kind of encoding with binascii. This is a set of routines used when you have to send binary data (for example a JPG) over a medium in which only ASCII is allowed or is safe (URLs and SMTP/email, for example). Base64 works as following
    • Using A-Z, a-z, 0-9 and a couple more characters, you have 64 code points or 6 bits of information.
    • 4 of these characters is 6x4 = 24 bits, the same as 3 bytes of data (3x8).
    • Base64 therefore takes blocks of 3 bytes and maps them into 4 safe characters.
    • In other words, you can convert any binary into a block of safe characters at the cost of 30% size increase.

I hope this helps

Upvotes: 3

Related Questions