Reputation: 33

Python XML CSV encoding and characters

In follow-up on a question someone helped me here with yesterday Lost in XML and Python I am trying to compare two strings.

String one is read from a XML file
String two is read from a CSV file

The problem is that both are stored differently :

CSV FILE HAS : "‚"
XML FILE HAS : "&amp;#8218;"

But without the "

printing the strings at the time of comparison shows me why they do not match :

These are the strings it is trying to match

FROM XML : &#8218;
FROM CSV : x82

This will probably happen for a lot more characters then this particular one. My question is how do I resolve this?

Read XML file differently?
Read CSV file differently?
Convert stored string before comparison?

After comparison the matching strings need to be stored and printed back in the format of the string in the XML.

Here is how I am opening and reading in my csv file :

import csv
csvdata = csv.reader(open('csvsmall.csv'))

csvfile = open(csvinput, "rb")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)

============================UPDATE============================================

Ok so according to the replies. I think It would be easiest to find a way to convert the escaped strings in the CSV file to the version in the XML file

That would mean converting :

"," which looks like it is being read as x82 to "&amp;#8218;"

Does anyone have any tips on how to do this on all the values of the csv that are stored in a dictionary? :

filenameToLabel = {}
for l,f in (x.strip().split(';') for x in (csvfile.readlines())[1:]):
    filenameToLabel[f] = l

Upvotes: 2

Answers (3)

John Machin

Reputation: 82992

Converting the CSV data into HTML character references is NOT a good idea. It is in general better to convert both to plain and simple Unicode.

You have ‚ and suchlike in the output from your XML parser. This can be unescaped using the effbot's unescape function, which also handles entities and hexadecimal character references. You should do this immediately after obtaining the data from your XML parser.

You should decode your csv data using the appropriate encoding, probably one of the cp1250 etc family. You have given us only one correspondence, "&#8128" <-> \x82. The byte \x82 is decoded as U+201A SINGLE LOW-9 QUOTATION MARK by all the Windows encodings cp1250 to cp1258 inclusive. To help you choose which one, tell us any other correspondences that you have, plus what country it was created in, what locale is in effect on the computer that created the file, what language the text is written in, any other background information that you have.

Upvotes: 1

Jukka K. Korpela

Reputation: 201728

If the XML file really contains &#8218;, meant to designate a single character, then you need to preprocess the data by unscaping & to &. Only after this would then XML data contain a proper character reference, and then you would need to interpret the XML correctly—which includes interpreting character references.

If the CSV data “‚” is 0x82 at the byte level, then the CSV data is in windows-1252 encoding or something similar. There is no indication of encoding in the CSV format itself, so you need to know it from other sources and to apply a suitable transcoding. This would mean transcoding to UTF-8 in practice, either upon reading the file or externally.

Upvotes: 1

cyphorious

Reputation: 829

I had a problem that seems to be the same as yours. What solved my problem was casting strings to unicode, if they weren't. I guess there is probably a more pythonic way for it, but this did the trick for me.

For parsing of XML files I use lxml, which has the possibility to write unicode xml files.

Upvotes: 1

Python XML CSV encoding and characters

Answers (3)

Related Questions