Patrick.H
Patrick.H

Reputation: 575

u'String' parsing csv file into dict Python

I am reading in a CSV file and it works quite well, but some of the Strings look like this:

u'Egg'

when trying to convert this to a String I get the Error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128). I have read various Questions similar to this, but trying the solutions provided already resulted in the same error.

Strangely when debugging as you can see in the picture the variable CITY, has the correct supposed to be value. But it still crashes.

Debugger Output

below my function:

def readData(filename, delimiter=";"):
    """
    Read in our data from a CSV file and create a dictionary of records,
    where the key is a unique record ID and each value is dict
    """
    data = pd.read_csv(filename, delimiter=delimiter, encoding="UTF-8")
    data.set_index("TRNUID")
    returnValue = {}
    for index, row in data.iterrows():
        if index == 0:
            print row["CITY"]
        else:
            if math.isnan(row["DUNS"]) == True:
                DUNS = ""
            else:
                DUNS = str((int(row["DUNS"])))[:-2]
            NAME = str(row["NAME"]).encode("utf-8")
            STREET = str(row["STREET"]).encode("utf-8")
            CITY = row["CITY"]
            POSTAL = str(row["POSTAL"]).encode("utf-8")
            returnValue[row["TRNUID"]] = {
                "DUNS": DUNS,
                "NAME": NAME,
                "STREET": STREET,
                "CITY": CITY,
                "POSTAL": POSTAL
            }
    return returnValue

Upvotes: 1

Views: 46

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76346

You're trying to convert to an ASCII string something that inherently cannot be converted to it.

If you look at the unicode character for \xfc, it is a "u" with an umlaut. Indeed, your screenshot of the variables shows "Egg a.d.Guntz" with an umlaut over the "u". The problem is not with "Egg", therefore, but with the continuation.

You could address this by removing all diacritics from your characters (as in this question), but you will lose information.

Upvotes: 1

Related Questions