MattB
MattB

Reputation: 141

Leave dates as strings using read_excel function from pandas in python

Python 2.7.10
Tried pandas 0.17.1 -- function read_excel
Tried pyexcel 0.1.7 + pyexcel-xlsx 0.0.7 -- function get_records()

When using pandas in Python is it possible to read excel files (formats: xls, xlsx) and leave columns containing date or date + time values as strings rather than auto-converting to datetime.datetime or timestamp types?

If this is not possible using pandas can someone suggest an alternate method/library to read xls, xlsx files and leave date column values as strings?

For the pandas solution attempts the df.info() and resulting date column types are shown below:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 117 entries, 0 to 116
Columns: 176 entries, Mine to Index
dtypes: datetime64[ns](2), float64(145), int64(26), object(3)
memory usage: 161.8+ KB
>>> type(df['Start Date'][0])
Out[6]: pandas.tslib.Timestamp
>>> type(df['End Date'][0])
Out[7]: pandas.tslib.Timestamp

Attempt/Approach 1:

def read_as_dataframe(filename, ext):
   import pandas as pd
   if ext in ('xls', 'xlsx'):
      # problem: date columns auto converted to datetime.datetime or timestamp!
      df = pd.read_excel(filename) # unwanted - date columns converted!

   return df, name, ext

Attempt/Approach 2:

import pandas as pd
# import datetime as datetime
# parse_date = lambda x: datetime.strptime(x, '%Y%m%d %H')
parse_date = lambda x: x
elif ext in ('xls', 'xlsx', ):
    df = pd.read_excel(filename, parse_dates=False)
    date_cols = [df.columns.get_loc(c) for c in df.columns if c in ('Start Date', 'End Date')]
    # problem: date columns auto converted to datetime.datetime or timestamp!
    df = pd.read_excel(filename, parse_dates=date_cols, date_parser=parse_date)

And have also tried pyexcel library but it does the same auto-magic convert behavior:

Attempt/Approach 3:

import pyexcel as pe
import pyexcel.ext.xls
import pyexcel.ext.xlsx

t0 = time.time()
if ext == 'xlsx':
    records = pe.get_records(file_name=filename)
    for record in records:
        print("start date = %s (type=%s), end date = %s (type=%s)" %
              (record['Start Date'],
               str(type(record['Start Date'])),
               record['End Date'],
               str(type(record['End Date'])))
              )

Upvotes: 14

Views: 27178

Answers (5)

mouwsy
mouwsy

Reputation: 1933

The docs of pd.read_excel say under the parameter parse_dates:

If you don't want to parse some cells as date just change their type in Excel to "Text".

This means one could preprocess the cells in Excel to the "Text" number format before using pd.read_excel. However, switching to "Text" number format alone changes the dates to numbers in Excel, e.g., 1900-12-31 becomes 366 (the cryptic number is the number of days since 1900-01-01 (it is 366 here because Excel falsely presumes that 1900 was a leap year)). This means that the instruction of just changing the number format to "Text" isn’t useful.

Instead, one has to iterate over the date columns and set the Excel VBA .Text as value and change the number format to "Text" to return the string how it displays, which can be done with xlwings:

from pathlib import Path
import xlwings as xw
import pandas as pd

path = Path(r"foo/test.xlsx")

with xw.App(visible=False):
    wb = xw.Book(path)
    ws = wb.sheets[0]
    for cell in ws.used_range:
        temp_str = cell.api.Text
        cell.number_format = "@"  # @ is the number format code of Excel for the "Text" number format.
        cell.value = temp_str
    wb.save(path.with_stem(f"{path.stem}_interim"))
    # Then the Excel file can be read with xlwings.
    df = ws.used_range.options(pd.DataFrame, index=False).value
    wb.close()

# Or it can be read with pandas.
df = pd.read_excel(path.with_stem(f"{path.stem}_interim"))

print(df)

Note the prerequisites for the installation of xlwings.
Unfortunately the .Text property can only be used on a range of one cell, which is the reason why a loop over the range of cells in question is required. Note that I used used_range in this example, which means that all cells with contents on the worksheet are formatted to the "Text" number format.

Upvotes: 0

Oliveira
Oliveira

Reputation: 99

I have had the same issue when extracting Dates from excel. My columns had the MM/DD/YYYY format, but reading it in Pyhton and converting to CSV the format was then converted to MM/DD/YYYY 00:00:00. Fortunately figured out a solution. Using

excel = pd.read_excel(file, dtype=object)

Instead of

excel = pd.read_excel(file, dtype=str)

Works.

I have no idea why and would appreciate some experienced python programmer to explain.

Upvotes: 1

Sriram Veturi
Sriram Veturi

Reputation: 206

I tried saving the file in a CSV UTF-8 format (manually) and used pd.read_csv() and worked fine.

I tried a bunch of things to figure the same thing with read_excel. Did not work anything for me. So, I am guessing read_excel is probably updating your string in a datetime object which you can not control.

Upvotes: 6

Nolan Conaway
Nolan Conaway

Reputation: 2757

I ran into an identical problem, except pandas was oddly converting only some cells into datetimes. I ended up manually converting each cell into a string like so:

def undate(x):
    if pd.isnull(x):
        return x
    try:
        return x.strftime('%d/%m/%Y')
    except AttributeError:
        return x
    except Exception:
        raise

for i in list_of_possible_date_columns:
    df[i] = df[i].apply(undate)

Upvotes: 9

YDD9
YDD9

Reputation: 135

  • Using converters{'Date': str} option inside the pandas.read_excel which helps. pandas.read_excel(xlsx, sheet, converters={'Date': str})
  • you can try convert your timestamp back to the original format
    df['Date'][0].strftime('%Y/%m/%d')

Upvotes: 5

Related Questions