Reputation: 611
I have the following python 3 code for a Flask app. For the app (website), you just upload a TXT or TSV file (containing author information), this is read into memory (since it's small and the app will be deployed to a read-only file system), then the app will format it in a particular way and display the results.
The issue I'm having is that when people upload the file with special characters in it (e.g. accents in authors' names), I get the error:
File "/Users/cdastmalchi/Desktop/author_script/main.py", line 81, in process_file
contents = csv.DictReader(file.read().decode('utf-8').splitlines(), delimiter='\t')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 201: invalid start byte
Example line with special characters:
Department of Pathology, Lariboisière Hospital, APHP and Paris Diderot University, Sorbonne Paris
Flask code:
@app.route('/process_file', methods=['POST'])
def process_file():
# Run checks on the file
if 'file' not in flask.request.files or not flask.request.files['file'].filename:
return flask.jsonify({'result':'False', 'message':'no files selected'})
return flask.redirect(url_for('home'))
file = flask.request.files['file']
filename = secure_filename(file.filename)
if not allowed_file(file.filename):
return flask.jsonify({'result':'False', 'message':'Must be TXT file!'})
return flask.redirect(url_for('home'))
# Stream file and check that places exist
contents = csv.DictReader(file.read().decode('utf-8').splitlines(), delimiter='\t')
check_places, json_data = places_exist(contents)
if check_places is False:
return flask.jsonify({'result':'False', 'message':'There is an affiliation missing from your Place list. Please re-try.'})
return flask.redirect(url_for('home'))
flask.session['filename'] = json_data
return flask.jsonify({'result':'True'})
Update:
When I do uchardet {file.tsv}
(where file.tsv is the test file with the special characters), the output is ISO-8859-9
Update 2:
Here's my attempt at trying to use csv.Sniffer()
on a test file with special characters. But I'm not quite sure how to translate this code to work with a file in memory.
import csv
sniff_range = 4096
delimiters = ';\t,'
infile_name = 'unicode.txt'
sniffer = csv.Sniffer()
with open(infile_name, 'r') as infile:
# Determine dialect
dialect = sniffer.sniff(
infile.read(sniff_range), delimiters=delimiters
)
infile.seek(0)
# Sniff for header
has_header = sniffer.has_header(infile.read(sniff_range))
infile.seek(0)
reader = csv.reader(infile, dialect)
for line in reader:
print(line)
output:
['Department of Pathology', 'Lariboisière Hospital', 'APHP and Paris Diderot University', 'Sorbonne Paris']
Question: How can I modify my csv.DictReader
code to handle these special characters (keeping in mind I can only read the file into memory)?
Update 3:
My question is different from the alleged dupe because I'm trying to figure out the encoding of a file stored in memory, which makes things trickier. I'm trying to implement the following method in my process_file
Flask route to determine the encoding, where file
in this case is a Flask file storage object (file = flask.request.files['file']
). But when I try to print the lines within contents
, I get nothing.
file = flask.request.files['file']
result = chardet.detect(file.read())
charenc = result['encoding']
contents = csv.DictReader(file.read().decode(charenc).splitlines(), delimiter='\t')
Upvotes: 0
Views: 819
Reputation: 55699
This version of your code successfully decodes and prints the file for me.
@app.route('/process_file', methods=['POST'])
def process_file():
# Run checks on the file
file = flask.request.files['file']
result = chardet.detect(file.read())
charenc = result['encoding']
file.seek(0)
# Stream file and check that places exist
reader = csv.DictReader(file.read().decode(charenc).splitlines())
for row in reader:
print(row)
return flask.jsonify({'result': charenc})
Upvotes: 1