Reputation: 11
I have upgraded pandas package, new version is : 1.4.2 and xlrd package, new version is 2.0.1
Now, when I read python file with the following command :
import pandas as pd pd.read_excel('myfile.xlsx')
I got the following warning: UserWarning: Workbook contains no default style, apply openpyxl's default
And my result is :
foo | bar |
---|---|
_x0031_01259 | COMMUNAUTE_x0020_DE_x0020_COMMUNES_x0020_FIER_x0020_ET_x0020_USSES |
While it should be :
foo | bar |
---|---|
101259 | COMMUNAUTE DE COMMUNES FIER ET USSES |
So in some columns, 1 is replaced by x0031, 2 is replaced by x0033, space is replaced by x0020, etc.
I tried to add engine parameter set to openpyxl but same warning message and same dataframe result.
Before Pandas package upgrade, I already had the problem but with engine parameter set to xlrd, it was working (but I had a warning saying that newest version won't support xlrd)
Any idea how to read correctly the file?
Upvotes: 1
Views: 836
Reputation: 1054
xlrd
package is not used for .xlsx
files in pandas
.
As you can read in pandas docs, for .xlsx
it is using openpyxl
that does not handle reading xlsx
properly in some specific scenarios, causing error you have described - check this bug report
It has been working for you before, because xlrd
used to support .xlsx
files, but since version 2.0.0 it supports only "classic" .xls
files.
As a workaround, you can use xlrd
code snippet from 1.x.x version in pandas, however you have to ensure your columns does not include strings like _x0000_
that you want to preserve:
def unescape(s,
subber=re.compile(r"_x[0-9A-Fa-f]{4,4}_", re.UNICODE).sub,
repl=lambda mobj: chr(int(mobj.group(0)[2:6], 16))):
if "_" in s:
return subber(repl, s)
return s
df["column"] = df["column"].apply(unescape)
Upvotes: 1