TooLateMate
TooLateMate

Reputation: 167

Pandas read_excel function ignoring dtype

I'm trying to read an excel file with pd.read_excel(). The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.

Example of the excel file

Example of the excel file

I've tried to specify the dtype or the converters arguments to no avail.

df = pd.read_excel('xls_test.xlsx',
                   dtype={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object
df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

As you can see the Date column is not treated as str...

Same thing when using converters

df = pd.read_excel('xls_test.xlsx',
                   converters={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object
df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

I have also tried to use other engine but the result is always the same.

The dtype argument seems to work as expected when reading a csv though

What am I doing wrong here ??

Edit: I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.

Upvotes: 9

Views: 8004

Answers (4)

Cimbali
Cimbali

Reputation: 11395

The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.

This means that, whatever you do with the dtype= argument, the data will be loaded as a date, and then converted to string, giving you the result you see:

>>> pd.read_excel('test.xlsx').head()
        Date      Time            Datetime
0 2020-03-08  10:00:00 2020-03-08 10:00:00
1 2020-03-09  11:00:00 2020-03-09 11:00:00
2 2020-03-10  12:00:00 2020-03-10 12:00:00
3 2020-03-11  13:00:00 2020-03-11 13:00:00
4 2020-03-12  14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date        datetime64[ns]
Time                object
Datetime    datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
                  Date      Time             Datetime
0  2020-03-08 00:00:00  10:00:00  2020-03-08 10:00:00
1  2020-03-09 00:00:00  11:00:00  2020-03-09 11:00:00
2  2020-03-10 00:00:00  12:00:00  2020-03-10 12:00:00
3  2020-03-11 00:00:00  13:00:00  2020-03-11 13:00:00
4  2020-03-12 00:00:00  14:00:00  2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date        string
Time        string
Datetime    string
dtype: object

Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()


That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:

  1. load data
  2. re-apply formatting

There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:

import datetime
import openpyxl
import re

date_corresp = {
    'dd': '%d',
    'mm': '%m',
    'yy': '%y',
    'yyyy': '%Y',
}

time_corresp = {
    'hh': '%h',
    'mm': '%M',
    'ss': '%S',
}

def datecell_as_formatted(cell):
    if isinstance(cell.value, datetime.time):
        dfmt, tfmt = '', cell.number_format
    elif isinstance(cell.value, (datetime.date, datetime.datetime)):
        dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
    else:
        raise ValueError('Not a datetime cell')

    for fmt in re.split(r'\W', dfmt):
        if fmt:
            dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)

    for fmt in re.split(r'\W', tfmt):
        if fmt:
            tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)

    return cell.value.strftime(dfmt + tfmt)

Which you can then use as follows:

>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'

(You can also complete the _corresp dictionaries with more date/time formatting items if they are incomplete)


* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.

Upvotes: 4

R. Marolahy
R. Marolahy

Reputation: 1586

Since version 1.0.0, there are two ways to store text data in pandas: object or StringDtype (source).

And since version 1.1.0, StringDtype now works in all situations where astype(str) or dtype=str work (source).

All dtypes can now be converted to StringDtype

You just need to specify dtype="string" when loading your data with pandas:

>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date    string
Time    string
dtype: object

Upvotes: 1

Tamil Selvan
Tamil Selvan

Reputation: 1749

here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype.

df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})

To understand more about the Pandas String Dtype use the link below,

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

Let me know if you have any issues on that !!

Upvotes: 3

Ossi H.
Ossi H.

Reputation: 84

The issue just like the other comments say is most likely a bug

Although not ideal, but you could always do something like this?

import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str}) 
# this line can be then simplified to : 
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)
                     Date        Time
0   "2020-03-08 00:00:00"  "10:00:00"
1   "2020-03-09 00:00:00"  "11:00:00"
2   "2020-03-10 00:00:00"  "12:00:00"
3   "2020-03-11 00:00:00"  "13:00:00"
4   "2020-03-12 00:00:00"  "14:00:00"
5   "2020-03-13 00:00:00"  "15:00:00"
6   "2020-03-14 00:00:00"  "16:00:00"
7   "2020-03-15 00:00:00"  "17:00:00"
8   "2020-03-16 00:00:00"  "18:00:00"
9   "2020-03-17 00:00:00"  "19:00:00"
10  "2020-03-18 00:00:00"  "20:00:00"
11  "2020-03-19 00:00:00"  "21:00:00"
object
object

Upvotes: 1

Related Questions