Reputation: 111
First of all, I'm parsing from a text file which I saved with notepad in UTF-8 encoding. Is this enough to make sure it's in UTF-8? I tried the chardet module, but it didn't really help me. Here's a few lines of the text file, if someone can find out more:
CUSTOMERLOC|1|N/A|N/A|LEGACY COPPER|N/A|Existing|N/A|NRZ|NRZ|N/A|N/A
FTSMAR08|01/A|N/A|N/A|LEGACY COPPER|N/A|Existing|N/A|NRZ|NRZ|N/A|N/A
FTSMAR08|01/B|N/A|N/A|LEGACY COPPER|N/A|Existing|N/A|NRZ|NRZ|N/A|N/A
I used the lxml module to write my XML and I used the tostring()
method and assigned it to a variable called data
.
I then used the a2b_qp()
function of the binascii
module to convert the XML string to binary and I put all of that into a bytearray
.
data = bytearray(binascii.a2b_qp(ET.tostring(root, pretty_print=True)), "UTF-8")
Now in my mind, this data
variable should contain my XML in binary form inside a bytearray
.
So, then I used an update cursor and inserted the data into a BLOB field of the table.
row[2] = data
cursor.updateRow(row)
Everything seems to work, but when I go to read the BLOB field using this code:
with arcpy.da.SearchCursor("Point", ['BlobField']) as cursor:
for row in cursor:
binaryRep = row[0]
open("C:/Blob.xml, 'wb').write(binaryRep.tobytes())
When I open the Blob.xml
file, I expect to see the XML string I first created in a readable form, but I get this mess with Notepad++ set to UTF-8 encoding:
And this mess with Notepad++ set to ANSI encoding:
I thought someone experienced might know what's going on by seeing the pictures. I've read a lot and tried to figure it out, but I've been stumped for awhile now.
Upvotes: 3
Views: 13505
Reputation: 35059
I'm parsing from a text file which I saved with notepad in UTF-8 encoding. Is this enough to make sure it's in UTF-8? I tried the chardet module, but it didn't really help me.
Yes, telling your editor to save it in a given encoding is enough to make sure it is saved in that encoding. If possible, this should also be recorded in the file somewhere - with XML, <?xml encoding="utf-8"?>
is a common way to specify this - but that just metadata, and doesn't actually control the encoding. chardet
is useful for when you don't know the encoding - but the kind of guesswork it does should be reserved as a last resort. UTF8 is usually a good default assumption, especially for XML.
The reason this line:
data = bytearray(binascii.a2b_qp(ET.tostring(root, pretty_print=True)), "UTF-8")
gives you nonsense is that it does some nasty stuff, and ends up with mojibake.
ET.tostring() defaults to encoding in ASCII (and will therefore lose any data that isn't ASCII-range, but that's beside the point for now). So, now you have an ASCII string. binascii.a2b_qp
decodes it using the quoted printable encoding. So, it turns it from something where everything is a printable ASCII character to something where that isn't necessarily the case (qp encodes any bytes that aren't in the printable ASCII range using 3 printable ASCII characters). That means, for example, if you have anything in your text saying =00, it will turn it into a null byte. The problem is that what you had was not QP-encoded, so QP-decoding it results in nonsense.
Then you use bytearray to encode it again as UTF8. bytearray assumes that if you give it an encoding, then the string is a unicode string - you break this assumption, and give it raw binary data (which is already meaningless). Encoding raw binary data as UTF8 isn't something that particularly makes sense, and this bit leads me to believe that you are using Python 2. Python 3 properly throws an error when you try to do this:
>>> bytearray(b'123', 'utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encoding or errors without a string argument
Python 2 is a lot murkier about what is bytes and what is decoded characters, making this problem type of problem a lot easier to run into. This is a really good reason to upgrade to Python 3 if you can. But it wouldn't have helped the previous nonsense you get out of a2b_qp (since it is a bytes<->bytes encoding).
The fix is to encode it in UTF-8 from the start, and forget about quoted-printable. (If you really do want it do be QP-encoded, run it through binascii.b2a after it is UTF8ified).
ElementTree lets you specify an encoding:
ET.tostring(root, encoding='utf-8')
will get you properly UTF-8 encoded XML, which will open nicely in Notepad++.
Upvotes: 4
Reputation: 5420
Storing:
xml_string.encode('utf-8')
)Retrieiving:
xml_string.decode('utf-8')
Upvotes: 0
Reputation: 19651
I think you're going off-track here:
binascii.a2b_qp(ET.tostring(root, pretty_print=True))
a2b_qp
assumes the input is in 'quoted printable' (similar to base64) but it's actually XML.
The result is that the binary is junk.
Instead you should use bytearray. Pass it your XML string and encoding ("utf-8"
) and it will return your blob.
Encodings are and interesting set of mental gymnastics. In summary:
unicode
datatype, not str
I hope this helps
Upvotes: 3