Reputation: 69
i'm using postman to send an excel file which i am reading in tornado.
self.request.files['1'][0]['body'].decode()
here if i send .csv than, the above code works.
if i send .xlsx file than i am stuck with this error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte
request.files will fetch the file but the type would be byte. so to convert byte to str i've used decode(), which works only for .csv and not for .xlsx
i tried decode('utf-8') but still no luck.
i've tried searching but didn't find any issue mentioning 0x87 problem?
Upvotes: 0
Views: 1625
Reputation: 11
I faced the same issue and this worked for me.
import io
df = pd.read_excel(io.BytesIO(self.request.files['1'][0]['body']))
Upvotes: 1
Reputation: 21854
The reason is that the .xlsx
file has a different encoding, not utf-8
. You'll need to use the original encoding to decode the file.
There's no guaranteed way of finding out the encoding of a file programmatically. I'm guessing you're making this application for general users and so you will keep encountering files with different and unexpected encodings.
A good way to deal with this is by trying to decode using multiple encodings, in case one fails. Example:
encodings = ['utf-8', 'iso-8859-1', 'windows-1251', 'windows-1252']
for encoding in encodings:
try:
decoded_file = self.request.files['1'][0]['body'].decode(encoding)
except UnicodeDecodeError:
# this will run when the current encoding fails
# just ignore the error and try the next one
pass
else:
# this will run when an encoding passes
# break the loop
# it is also a good idea to re-encode the
# decoded files to utf-8 for your purpose
decoded_file = decoded_file.encode("utf8")
break
else:
# this will run when the for loop ends
# without successfully decoding the file
# now you can return an error message
# to the user asking them to change
# the file encoding and re upload
self.write("Error: Unidentified file encoding. Re-upload with UTF-8 encoding")
return
# when the program reaches here, it means
# you have successfully decoded the file
# and you can access it from `decoded_file` variable
Here's a list of some common encodings: What is the most common encoding of each language?
Upvotes: 1