How to read in special characters using csv.DictReader

Question

I have the following python 3 code for a Flask app. For the app (website), you just upload a TXT or TSV file (containing author information), this is read into memory (since it's small and the app will be deployed to a read-only file system), then the app will format it in a particular way and display the results.

The issue I'm having is that when people upload the file with special characters in it (e.g. accents in authors' names), I get the error:

  File "/Users/cdastmalchi/Desktop/author_script/main.py", line 81, in process_file
    contents = csv.DictReader(file.read().decode('utf-8').splitlines(), delimiter='	')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 201: invalid start byte

Example line with special characters:

Department of Pathology, Lariboisière Hospital, APHP and Paris Diderot University, Sorbonne Paris

Flask code:

@app.route('/process_file', methods=['POST'])
def process_file():
    # Run checks on the file
    if 'file' not in flask.request.files or not flask.request.files['file'].filename:
        return flask.jsonify({'result':'False', 'message':'no files selected'})
        return flask.redirect(url_for('home'))
    file = flask.request.files['file']
    filename = secure_filename(file.filename)
    if not allowed_file(file.filename):
        return flask.jsonify({'result':'False', 'message':'Must be TXT file!'})
        return flask.redirect(url_for('home'))

    # Stream file and check that places exist
    contents = csv.DictReader(file.read().decode('utf-8').splitlines(), delimiter='	')
    check_places, json_data = places_exist(contents)

    if check_places is False:
        return flask.jsonify({'result':'False', 'message':'There is an affiliation missing from your Place list. Please re-try.'})
        return flask.redirect(url_for('home'))

    flask.session['filename'] = json_data
    return flask.jsonify({'result':'True'})

Update:

When I do uchardet {file.tsv} (where file.tsv is the test file with the special characters), the output is ISO-8859-9

Update 2:

Here's my attempt at trying to use csv.Sniffer() on a test file with special characters. But I'm not quite sure how to translate this code to work with a file in memory.

import csv

sniff_range = 4096
delimiters = ';	,'

infile_name = 'unicode.txt'

sniffer = csv.Sniffer()

with open(infile_name, 'r') as infile:
    # Determine dialect
    dialect = sniffer.sniff(
        infile.read(sniff_range), delimiters=delimiters
    )
    infile.seek(0)

    # Sniff for header
    has_header = sniffer.has_header(infile.read(sniff_range))
    infile.seek(0)

    reader = csv.reader(infile, dialect)

    for line in reader:
        print(line)

output:

['Department of Pathology', 'Lariboisière Hospital', 'APHP and Paris Diderot University', 'Sorbonne Paris']

Question: How can I modify my csv.DictReader code to handle these special characters (keeping in mind I can only read the file into memory)?

Update 3:

My question is different from the alleged dupe because I'm trying to figure out the encoding of a file stored in memory, which makes things trickier. I'm trying to implement the following method in my process_file Flask route to determine the encoding, where file in this case is a Flask file storage object (file = flask.request.files['file']). But when I try to print the lines within contents, I get nothing.

file = flask.request.files['file']
result = chardet.detect(file.read())
charenc = result['encoding']

contents = csv.DictReader(file.read().decode(charenc).splitlines(), delimiter='	')

How to read in special characters using csv.DictReader

Answers (1)

Related Questions