shabalong
shabalong

Reputation: 45

Trying to write a list of dictionaries to csv in Python, running into encoding issues

So I am running into an encoding problem stemming from writing dictionaries to csv in Python.

Here is an example code:

import csv

some_list = ['jalape\xc3\xb1o']

with open('test_encode_output.csv', 'wb') as csvfile:
    output_file = csv.writer(csvfile)
    for item in some_list:
        output_file.writerow([item])

This works perfectly fine and gives me a csv file with "jalapeño" written in it.

However, when I create a list of dictionaries with values that contain such UTF-8 characters...

import csv

some_list = [{'main': ['4 dried ancho chile peppers, stems, veins
            and seeds removed']}, {'main': ['2 jalape\xc3\xb1o 
            peppers, seeded and chopped', '1 dash salt']}]

with open('test_encode_output.csv', 'wb') as csvfile:
    output_file = csv.writer(csvfile)
    for item in some_list:
        output_file.writerow([item])

I just get a csv file with 2 rows with the following entries:

{'main': ['4 dried ancho chile peppers, stems, veins and seeds removed']}
{'main': ['2 jalape\xc3\xb1o peppers, seeded and chopped', '1 dash salt']}

I know I have my stuff written in the right encoding, but because they aren't strings, when they are written out by csv.writer, they are written as-is. This is frustrating. I searched for some similar questions on here and people have mentioned using csv.DictWriter but that wouldn't really work well for me because my list of dictionaries aren't all just with 1 key 'main'. Some have other keys like 'toppings', 'crust', etc. Not just that, I'm still doing more work on them where the eventual output is to have the ingredients formatted in amount, unit, ingredient, so I will end up with a list of dictionaries like

[{'main': {'amount': ['4'], 'unit': [''], 
'ingredient': ['dried ancho chile peppers']}},
{'topping': {'amount': ['1'], 'unit': ['pump'], 
'ingredient': ['cool whip']}, 'filling': 
{'amount': ['2'], 'unit': ['cups'], 
'ingredient': ['strawberry jam']}}]

Seriously, any help would be greatly appreciated, else I'd have to use a find and replace in LibreOffice to fix all those \x** UTF-8 encodings.

Thank you!

Upvotes: 0

Views: 2139

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1123850

You are writing dictionaries to the CSV file, while .writerow() expects lists with singular values that are turned into strings on writing.

Don't write dictionaries, these are turned into string representations, as you've discovered.

You need to determine how the keys and / or values of each dictionary are to be turned into columns, where each column is a single primitive value.

If, for example, you only want to write the main key (if present) then do so:

with open('test_encode_output.csv', 'wb') as csvfile:
    output_file = csv.writer(csvfile)
    for item in some_list:
        if 'main' in item:
            output_file.writerow(item['main'])

where it is assumed that the value associated with the 'main' key is always a list of values.

If you wanted to persist dictionaries with Unicode values, then you are using the wrong tool. CSV is a flat data format, just rows and primitive columns. Use a tool that can preserve the right amount of information instead.

For dictionaries with string keys, lists, numbers and unicode text, you can use JSON, or you can use pickle if more complex and custom data types are involved. When using JSON, you do want to either decode from byte strings to Python Unicode values, or always use UTF-8-encoded byte strings, or state how the json library should handle string encoding for you with the encoding keyword:

import json

with open('data.json', 'w') as jsonfile:
    json.dump(some_list, jsonfile, encoding='utf8')

because JSON strings are always unicode values. The default for encoding is utf8 but I added it here for clarity.

Loading the data again:

with open('data.json', 'r') as jsonfile:
    some_list = json.load(jsonfile)

Note that this will return unicode strings, not strings encoded to UTF8.

The pickle module works much the same way, but the data format is not human-readable:

import pickle

# store
with open('data.pickle', 'wb') as pfile:
    pickle.dump(some_list, pfile)

# load
with open('data.pickle', 'rb') as pfile:
    some_list = pickle.load(pfile)

pickle will return your data exactly as you stored it. Byte strings remain byte strings, unicode values would be restored as unicode.

Upvotes: 2

Gert-Jan Peeters
Gert-Jan Peeters

Reputation: 79

As you see in your output, you've used a dictionary so if you want that string to be processed you have to write this:

import csv

some_list = [{'main': ['4 dried ancho chile peppers, stems, veins', '\xc2\xa0\xc2\xa0\xc2\xa0 and seeds removed']}, {'main': ['2 jalape\xc3\xb1o peppers, seeded and chopped', '1 dash salt']}]

with open('test_encode_output.csv', 'wb') as csvfile:
    output_file = csv.writer(csvfile)
    for item in some_list:
        output_file.writerow(item['main'])  #so instead of [item], we use item['main']

I understand that this is possibly not the code you want as it limits you to call every key main but at least it gets processed now.

You might want to formulate what you want to do a bit better as now it is not really clear (at least to me). For example do you want a csv file that gives you main in the first cell and then 4 dried ...

Upvotes: 0

Related Questions