andy
andy

Reputation: 1459

Python - read csv file of unicode substitutions

I need to replace unicode according to a custom set of substitutions. The custom substitutions are defined by someone else's API and I basically just have to deal with it. As it stands I have extracted all the required substitutions into a csv file. Here's a sample:

\u0020, 
\u0021,!
\u0023,#
\u0024,$
\u0025,%
\u0026,&
\u0028,(
\u0029,)
\u002a,*
\u002b,+
\u002c,","
\u002d,-
\u002e,.
\u002f,/
\u03ba,kappa
...

I generated this in MS Excel by hacking up the java program the API owners use for themselves when they need to do conversions (and no...they won't just run the converter when the API receives input...). There are ~1500 substitutions defined.

When I generate output (from my Django application) to send to their API as input, I want to handle the substitutions. Here is how I have been trying to do it:

class UTF8Converter(object):
    def __init__(self):
        #create replacement mapper
        full_file_path = os.path.join(os.path.dirname(__file__),
                                      CONVERSION_FILE)
        with open(full_file_path) as csvfile:
            reader = csv.reader(csvfile)
            mapping = []
            for row in reader:
                #remove escape-y slash
                mapping.append( (row[0], row[1]) ) # here's the problem
        self.mapping = mapping

    def replace_UTF8(self, string):
        for old, new in self.mapping:
            print new
            string.replace(old, new)
        return string

The problem is that the unicode codes in the csv file are appearing as, for example, self.mapping[example][0] = '\\u00e0'. Ok, well that's wrong, so let's try:

mapping.append( (row[0].decode("string_escape"), row[1]) )

No change. How about:

mapping.append( (row[0].decode("unicode_escape"), row[1]) )

Ok, now self.mapping[example][0] = u'\xe0'. So yeah, that's the character that I need to replace...but the string that I need to call the replace_UTF8() function on looks like u'\u00e0'.

I have also tried row[0].decode("utf-8"), row[0].encode("utf-8"), unicode(row[0], "utf-8").

I also tried this but I don't have unicode characters in the csv file, I have unicode code points (not sure if that is the correct terminology or what).

So, how do I turn the string that I read in from the csv file into a unicode string that I can use with mythingthatneedsconverted.replace(...)?

Or...do I need to do something else with the csv file to use a more sensible approach?

Upvotes: 0

Views: 539

Answers (1)

abarnert
abarnert

Reputation: 365747

I don't think your problem actually exists:

Ok, now self.mapping[example][0] = u'\xe0'. So yeah, that's the character that I need to replace...but the string that I need to call the replace_UTF8() function on looks like u'\u00e0'.

Those are just different representations of the exact same string. You can test it yourself:

>>> u'\xe0' == u'\u00e0'
True

The actual problem is that you're not doing any replacing. In this code:

def replace_UTF8(self, string):
    for old, new in self.mapping:
        print new
        string.replace(old, new)
    return string

You're just calling string.replace over and over, which returns a new string, but does nothing to string itself. (It can't do anything to string itself; strings are immutable.) What you want is:

def replace_UTF8(self, string):
    for old, new in self.mapping:
        print new
        string = string.replace(old, new)
    return string

However, if string really is a UTF-8-encoded str, as the function name implies, this still won't work. When you UTF-8-encode u'\u00e0', what you get is '\xce\xa0'. There is no \u00e0 in there to be replaced. So, what you really need to do is decode it, do the replaces, then re-encode. Like this:

def replace_UTF8(self, string):
    u = string.decode('utf-8')
    for old, new in self.mapping:
        print new
        u = u.replace(old, new)
    return u.encode('utf-8')

Or, even better, keep things as unicode instead of encoded str throughout your program except at the very edges, so you don't have to worry about this stuff.


Finally, this is a very slow and complicated way to do the replacing, when strings (both str and unicode) have a built-in translate method to do exactly what you want.

Instead of building your table as a list of pairs of Unicode strings, build it as a dict mapping ordinals to ordinals:

mapping = {}
for row in reader:
    mapping[ord(row[0].decode("unicode_escape"))] = ord(row[1])

And now, the whole thing is a one-liner, even with your encoding mess:

def replace_UTF8(self, string):
    return string.decode('utf-8').translate(self.mapping).encode('utf-8')

Upvotes: 1

Related Questions