Reputation: 33
In follow-up on a question someone helped me here with yesterday Lost in XML and Python I am trying to compare two strings.
The problem is that both are stored differently :
CSV FILE HAS : "‚"
XML FILE HAS : "‚"
But without the "
printing the strings at the time of comparison shows me why they do not match :
These are the strings it is trying to match
FROM XML : ‚
FROM CSV : x82
This will probably happen for a lot more characters then this particular one. My question is how do I resolve this?
After comparison the matching strings need to be stored and printed back in the format of the string in the XML.
Here is how I am opening and reading in my csv file :
import csv
csvdata = csv.reader(open('csvsmall.csv'))
csvfile = open(csvinput, "rb")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
============================UPDATE============================================
Ok so according to the replies. I think It would be easiest to find a way to convert the escaped strings in the CSV file to the version in the XML file
That would mean converting :
"," which looks like it is being read as x82 to "‚"
Does anyone have any tips on how to do this on all the values of the csv that are stored in a dictionary? :
filenameToLabel = {}
for l,f in (x.strip().split(';') for x in (csvfile.readlines())[1:]):
filenameToLabel[f] = l
Upvotes: 2
Views: 1042
Reputation: 82992
Converting the CSV data into HTML character references is NOT a good idea. It is in general better to convert both to plain and simple Unicode.
You have ‚
and suchlike in the output from your XML parser. This can be unescaped using the effbot's unescape function, which also handles entities and hexadecimal character references. You should do this immediately after obtaining the data from your XML parser.
You should decode your csv data using the appropriate encoding, probably one of the cp1250
etc family. You have given us only one correspondence, "῀" <-> \x82
. The byte \x82
is decoded as U+201A SINGLE LOW-9 QUOTATION MARK
by all the Windows encodings cp1250
to cp1258
inclusive. To help you choose which one, tell us any other correspondences that you have, plus what country it was created in, what locale is in effect on the computer that created the file, what language the text is written in, any other background information that you have.
Upvotes: 1
Reputation: 201728
If the XML file really contains &#8218;
, meant to designate a single character, then you need to preprocess the data by unscaping &
to &
. Only after this would then XML data contain a proper character reference, and then you would need to interpret the XML correctly—which includes interpreting character references.
If the CSV data “‚” is 0x82 at the byte level, then the CSV data is in windows-1252 encoding or something similar. There is no indication of encoding in the CSV format itself, so you need to know it from other sources and to apply a suitable transcoding. This would mean transcoding to UTF-8 in practice, either upon reading the file or externally.
Upvotes: 1
Reputation: 829
I had a problem that seems to be the same as yours. What solved my problem was casting strings to unicode, if they weren't. I guess there is probably a more pythonic way for it, but this did the trick for me.
For parsing of XML files I use lxml, which has the possibility to write unicode xml files.
Upvotes: 1