HelloSmacl
HelloSmacl

Reputation: 11

read_excel function changes some characters to unicode ones since pandas package upgrade

I have upgraded pandas package, new version is : 1.4.2 and xlrd package, new version is 2.0.1

Now, when I read python file with the following command :

import pandas as pd pd.read_excel('myfile.xlsx')

I got the following warning: UserWarning: Workbook contains no default style, apply openpyxl's default

And my result is :

foo bar
_x0031_01259 COMMUNAUTE_x0020_DE_x0020_COMMUNES_x0020_FIER_x0020_ET_x0020_USSES

While it should be :

foo bar
101259 COMMUNAUTE DE COMMUNES FIER ET USSES

So in some columns, 1 is replaced by x0031, 2 is replaced by x0033, space is replaced by x0020, etc.

I tried to add engine parameter set to openpyxl but same warning message and same dataframe result.

Before Pandas package upgrade, I already had the problem but with engine parameter set to xlrd, it was working (but I had a warning saying that newest version won't support xlrd)

Any idea how to read correctly the file?

Upvotes: 1

Views: 836

Answers (1)

Maciej Majewski
Maciej Majewski

Reputation: 1054

xlrd package is not used for .xlsx files in pandas.

As you can read in pandas docs, for .xlsx it is using openpyxl that does not handle reading xlsx properly in some specific scenarios, causing error you have described - check this bug report

It has been working for you before, because xlrd used to support .xlsx files, but since version 2.0.0 it supports only "classic" .xls files.

As a workaround, you can use xlrd code snippet from 1.x.x version in pandas, however you have to ensure your columns does not include strings like _x0000_ that you want to preserve:

def unescape(s,
             subber=re.compile(r"_x[0-9A-Fa-f]{4,4}_", re.UNICODE).sub,
             repl=lambda mobj: chr(int(mobj.group(0)[2:6], 16))):
    if "_" in s:
        return subber(repl, s)

    return s

df["column"] = df["column"].apply(unescape)

Upvotes: 1

Related Questions