Reputation: 167
I'm trying to read an excel file with pd.read_excel(). The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.
Example of the excel file
I've tried to specify the dtype or the converters arguments to no avail.
df = pd.read_excel('xls_test.xlsx',
dtype={'Date':str,'Time':str})
df.dtypes
Date object
Time object
dtype: object
df.head()
Date Time
0 2020-03-08 00:00:00 10:00:00
1 2020-03-09 00:00:00 11:00:00
2 2020-03-10 00:00:00 12:00:00
3 2020-03-11 00:00:00 13:00:00
4 2020-03-12 00:00:00 14:00:00
As you can see the Date column is not treated as str...
Same thing when using converters
df = pd.read_excel('xls_test.xlsx',
converters={'Date':str,'Time':str})
df.dtypes
Date object
Time object
dtype: object
df.head()
Date Time
0 2020-03-08 00:00:00 10:00:00
1 2020-03-09 00:00:00 11:00:00
2 2020-03-10 00:00:00 12:00:00
3 2020-03-11 00:00:00 13:00:00
4 2020-03-12 00:00:00 14:00:00
I have also tried to use other engine but the result is always the same.
The dtype argument seems to work as expected when reading a csv though
What am I doing wrong here ??
Edit: I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.
Upvotes: 9
Views: 8004
Reputation: 11395
The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.
This means that, whatever you do with the dtype=
argument, the data will be loaded as a date, and then converted to string, giving you the result you see:
>>> pd.read_excel('test.xlsx').head()
Date Time Datetime
0 2020-03-08 10:00:00 2020-03-08 10:00:00
1 2020-03-09 11:00:00 2020-03-09 11:00:00
2 2020-03-10 12:00:00 2020-03-10 12:00:00
3 2020-03-11 13:00:00 2020-03-11 13:00:00
4 2020-03-12 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date datetime64[ns]
Time object
Datetime datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
Date Time Datetime
0 2020-03-08 00:00:00 10:00:00 2020-03-08 10:00:00
1 2020-03-09 00:00:00 11:00:00 2020-03-09 11:00:00
2 2020-03-10 00:00:00 12:00:00 2020-03-10 12:00:00
3 2020-03-11 00:00:00 13:00:00 2020-03-11 13:00:00
4 2020-03-12 00:00:00 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date string
Time string
Datetime string
dtype: object
Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()
That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:
There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:
import datetime
import openpyxl
import re
date_corresp = {
'dd': '%d',
'mm': '%m',
'yy': '%y',
'yyyy': '%Y',
}
time_corresp = {
'hh': '%h',
'mm': '%M',
'ss': '%S',
}
def datecell_as_formatted(cell):
if isinstance(cell.value, datetime.time):
dfmt, tfmt = '', cell.number_format
elif isinstance(cell.value, (datetime.date, datetime.datetime)):
dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
else:
raise ValueError('Not a datetime cell')
for fmt in re.split(r'\W', dfmt):
if fmt:
dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)
for fmt in re.split(r'\W', tfmt):
if fmt:
tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)
return cell.value.strftime(dfmt + tfmt)
Which you can then use as follows:
>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'
(You can also complete the _corresp
dictionaries with more date/time formatting items if they are incomplete)
* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.
Upvotes: 4
Reputation: 1586
Since version 1.0.0
, there are two ways to store text data in pandas: object
or StringDtype
(source).
And since version 1.1.0
, StringDtype now works in all situations where astype(str)
or dtype=str
work (source).
All dtypes can now be converted to StringDtype
You just need to specify dtype="string"
when loading your data with pandas:
>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date string
Time string
dtype: object
Upvotes: 1
Reputation: 1749
here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype.
df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})
To understand more about the Pandas String Dtype use the link below,
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
Let me know if you have any issues on that !!
Upvotes: 3
Reputation: 84
The issue just like the other comments say is most likely a bug
Although not ideal, but you could always do something like this?
import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str})
# this line can be then simplified to :
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)
Date Time
0 "2020-03-08 00:00:00" "10:00:00"
1 "2020-03-09 00:00:00" "11:00:00"
2 "2020-03-10 00:00:00" "12:00:00"
3 "2020-03-11 00:00:00" "13:00:00"
4 "2020-03-12 00:00:00" "14:00:00"
5 "2020-03-13 00:00:00" "15:00:00"
6 "2020-03-14 00:00:00" "16:00:00"
7 "2020-03-15 00:00:00" "17:00:00"
8 "2020-03-16 00:00:00" "18:00:00"
9 "2020-03-17 00:00:00" "19:00:00"
10 "2020-03-18 00:00:00" "20:00:00"
11 "2020-03-19 00:00:00" "21:00:00"
object
object
Upvotes: 1