Reputation: 141
Python 2.7.10
Tried pandas 0.17.1 -- function read_excel
Tried pyexcel 0.1.7 + pyexcel-xlsx 0.0.7 -- function get_records()
When using pandas in Python is it possible to read excel files (formats: xls, xlsx) and leave columns containing date or date + time values as strings rather than auto-converting to datetime.datetime
or timestamp
types?
If this is not possible using pandas can someone suggest an alternate method/library to read xls, xlsx files and leave date column values as strings?
For the pandas solution attempts the df.info()
and resulting date column types are shown below:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 117 entries, 0 to 116
Columns: 176 entries, Mine to Index
dtypes: datetime64[ns](2), float64(145), int64(26), object(3)
memory usage: 161.8+ KB
>>> type(df['Start Date'][0])
Out[6]: pandas.tslib.Timestamp
>>> type(df['End Date'][0])
Out[7]: pandas.tslib.Timestamp
Attempt/Approach 1:
def read_as_dataframe(filename, ext):
import pandas as pd
if ext in ('xls', 'xlsx'):
# problem: date columns auto converted to datetime.datetime or timestamp!
df = pd.read_excel(filename) # unwanted - date columns converted!
return df, name, ext
Attempt/Approach 2:
import pandas as pd
# import datetime as datetime
# parse_date = lambda x: datetime.strptime(x, '%Y%m%d %H')
parse_date = lambda x: x
elif ext in ('xls', 'xlsx', ):
df = pd.read_excel(filename, parse_dates=False)
date_cols = [df.columns.get_loc(c) for c in df.columns if c in ('Start Date', 'End Date')]
# problem: date columns auto converted to datetime.datetime or timestamp!
df = pd.read_excel(filename, parse_dates=date_cols, date_parser=parse_date)
And have also tried pyexcel library but it does the same auto-magic convert behavior:
Attempt/Approach 3:
import pyexcel as pe
import pyexcel.ext.xls
import pyexcel.ext.xlsx
t0 = time.time()
if ext == 'xlsx':
records = pe.get_records(file_name=filename)
for record in records:
print("start date = %s (type=%s), end date = %s (type=%s)" %
(record['Start Date'],
str(type(record['Start Date'])),
record['End Date'],
str(type(record['End Date'])))
)
Upvotes: 14
Views: 27178
Reputation: 1933
The docs of pd.read_excel
say under the parameter parse_dates
:
If you don't want to parse some cells as date just change their type in Excel to "Text".
This means one could preprocess the cells in Excel to the "Text" number format before using pd.read_excel
. However, switching to "Text" number format alone changes the dates to numbers in Excel, e.g., 1900-12-31 becomes 366 (the cryptic number is the number of days since 1900-01-01 (it is 366 here because Excel falsely presumes that 1900 was a leap year)). This means that the instruction of just changing the number format to "Text" isn’t useful.
Instead, one has to iterate over the date columns and set the Excel VBA .Text
as value and change the number format to "Text" to return the string how it displays, which can be done with xlwings:
from pathlib import Path
import xlwings as xw
import pandas as pd
path = Path(r"foo/test.xlsx")
with xw.App(visible=False):
wb = xw.Book(path)
ws = wb.sheets[0]
for cell in ws.used_range:
temp_str = cell.api.Text
cell.number_format = "@" # @ is the number format code of Excel for the "Text" number format.
cell.value = temp_str
wb.save(path.with_stem(f"{path.stem}_interim"))
# Then the Excel file can be read with xlwings.
df = ws.used_range.options(pd.DataFrame, index=False).value
wb.close()
# Or it can be read with pandas.
df = pd.read_excel(path.with_stem(f"{path.stem}_interim"))
print(df)
Note the prerequisites for the installation of xlwings.
Unfortunately the .Text
property can only be used on a range of one cell, which is the reason why a loop over the range of cells in question is required. Note that I used used_range in this example, which means that all cells with contents on the worksheet are formatted to the "Text" number format.
Upvotes: 0
Reputation: 99
I have had the same issue when extracting Dates from excel. My columns had the MM/DD/YYYY format, but reading it in Pyhton and converting to CSV the format was then converted to MM/DD/YYYY 00:00:00. Fortunately figured out a solution. Using
excel = pd.read_excel(file, dtype=object)
Instead of
excel = pd.read_excel(file, dtype=str)
Works.
I have no idea why and would appreciate some experienced python programmer to explain.
Upvotes: 1
Reputation: 206
I tried saving the file in a CSV UTF-8 format
(manually) and used pd.read_csv()
and worked fine.
I tried a bunch of things to figure the same thing with read_excel
. Did not work anything for me. So, I am guessing read_excel
is probably updating your string in a datetime object which you can not control.
Upvotes: 6
Reputation: 2757
I ran into an identical problem, except pandas was oddly converting only some cells into datetimes. I ended up manually converting each cell into a string like so:
def undate(x):
if pd.isnull(x):
return x
try:
return x.strftime('%d/%m/%Y')
except AttributeError:
return x
except Exception:
raise
for i in list_of_possible_date_columns:
df[i] = df[i].apply(undate)
Upvotes: 9
Reputation: 135
pandas.read_excel(xlsx, sheet, converters={'Date': str})
df['Date'][0].strftime('%Y/%m/%d')
Upvotes: 5