Reputation: 167

Pandas read_excel function ignoring dtype

I'm trying to read an excel file with pd.read_excel(). The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.

Example of the excel file

I've tried to specify the dtype or the converters arguments to no avail.

df = pd.read_excel('xls_test.xlsx',
                   dtype={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object

df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

As you can see the Date column is not treated as str...

Same thing when using converters

df = pd.read_excel('xls_test.xlsx',
                   converters={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object

df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

I have also tried to use other engine but the result is always the same.

The dtype argument seems to work as expected when reading a csv though

What am I doing wrong here ??

Edit: I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.

Upvotes: 9

Answers (4)

Cimbali

Reputation: 11395

The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.

This means that, whatever you do with the dtype= argument, the data will be loaded as a date, and then converted to string, giving you the result you see:

>>> pd.read_excel('test.xlsx').head()
        Date      Time            Datetime
0 2020-03-08  10:00:00 2020-03-08 10:00:00
1 2020-03-09  11:00:00 2020-03-09 11:00:00
2 2020-03-10  12:00:00 2020-03-10 12:00:00
3 2020-03-11  13:00:00 2020-03-11 13:00:00
4 2020-03-12  14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date        datetime64[ns]
Time                object
Datetime    datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
                  Date      Time             Datetime
0  2020-03-08 00:00:00  10:00:00  2020-03-08 10:00:00
1  2020-03-09 00:00:00  11:00:00  2020-03-09 11:00:00
2  2020-03-10 00:00:00  12:00:00  2020-03-10 12:00:00
3  2020-03-11 00:00:00  13:00:00  2020-03-11 13:00:00
4  2020-03-12 00:00:00  14:00:00  2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date        string
Time        string
Datetime    string
dtype: object

Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()

That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:

load data
re-apply formatting

There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:

import datetime
import openpyxl
import re

date_corresp = {
    'dd': '%d',
    'mm': '%m',
    'yy': '%y',
    'yyyy': '%Y',
}

time_corresp = {
    'hh': '%h',
    'mm': '%M',
    'ss': '%S',
}

def datecell_as_formatted(cell):
    if isinstance(cell.value, datetime.time):
        dfmt, tfmt = '', cell.number_format
    elif isinstance(cell.value, (datetime.date, datetime.datetime)):
        dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
    else:
        raise ValueError('Not a datetime cell')

    for fmt in re.split(r'\W', dfmt):
        if fmt:
            dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)

    for fmt in re.split(r'\W', tfmt):
        if fmt:
            tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)

    return cell.value.strftime(dfmt + tfmt)

Which you can then use as follows:

>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'

(You can also complete the _corresp dictionaries with more date/time formatting items if they are incomplete)

_{* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.}

Upvotes: 4

R. Marolahy

Reputation: 1586

Since version 1.0.0, there are two ways to store text data in pandas: object or StringDtype (source).

And since version 1.1.0, StringDtype now works in all situations where astype(str) or dtype=str work (source).

All dtypes can now be converted to StringDtype

You just need to specify dtype="string" when loading your data with pandas:

>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date    string
Time    string
dtype: object

Upvotes: 1

Tamil Selvan

Reputation: 1749

here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype.

df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})

To understand more about the Pandas String Dtype use the link below,

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

Let me know if you have any issues on that !!

Upvotes: 3

Ossi H.

Reputation: 84

The issue just like the other comments say is most likely a bug

Although not ideal, but you could always do something like this?

import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str}) 
# this line can be then simplified to : 
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)

                     Date        Time
0   "2020-03-08 00:00:00"  "10:00:00"
1   "2020-03-09 00:00:00"  "11:00:00"
2   "2020-03-10 00:00:00"  "12:00:00"
3   "2020-03-11 00:00:00"  "13:00:00"
4   "2020-03-12 00:00:00"  "14:00:00"
5   "2020-03-13 00:00:00"  "15:00:00"
6   "2020-03-14 00:00:00"  "16:00:00"
7   "2020-03-15 00:00:00"  "17:00:00"
8   "2020-03-16 00:00:00"  "18:00:00"
9   "2020-03-17 00:00:00"  "19:00:00"
10  "2020-03-18 00:00:00"  "20:00:00"
11  "2020-03-19 00:00:00"  "21:00:00"
object
object

Upvotes: 1

Pandas read_excel function ignoring dtype

Answers (4)

Related Questions