John
John

Reputation: 1273

python pandas incorrectly reading excel dates

I have an excel file with dates formatted as such:

22.10.07 16:00
22.10.07 17:00
22.10.07 18:00
22.10.07 19:00

After using the parse method of pandas to read the data, the dates are read almost correctly:

In [55]: nts.data['Tid'][10000:10005]
Out[55]:
10000    2007-10-22 15:59:59.997905
10001    2007-10-22 16:59:59.997904
10002    2007-10-22 17:59:59.997904
10003    2007-10-22 18:59:59.997904

What do I need to do to either a) get it to work correctly, or b) is there a trick to fix this easily? (e.g. some kind of 'round' function for datetime)

Upvotes: 4

Views: 6538

Answers (2)

Ben
Ben

Reputation: 1688

I encountered the same issue and got around it by not parsing the dates using Pandas, but rather applying my own function (shown below) to the relevant column(s) of the dataframe:

def ExcelDateToDateTime(xlDate):
    epoch = dt.datetime(1899, 12, 30)
    delta = dt.timedelta(hours = round(xlDate*24))
    return epoch + delta

df = pd.DataFrame.from_csv('path')

df['Date'] = df['Date'].apply(ExcelDateToDateTime)

Note: This will ignore any time granularity below the hour level, but that's all I need, and it looks from your example that this could be the case for you too.

Upvotes: 3

Raffaele
Raffaele

Reputation: 20885

Excel serializes datetimes with a ddddd.tttttt format, where the d part is an integer number representing the offset from a reference day (like Dec 31st, 1899), and the t part is a fraction between 0.0 and 1.0 which stands for the part of the day at the given time (for example at 12:00 it's 0.5, at 18:00 it's 0.75 and so on).

I asked you to upload a file with sample data. .xlsx files are really ZIP archives which contains your XML-serialized worksheets. This are the dates I extracted from the relevant column. Excerpt:

38961.666666666628
38961.708333333292
38961.749999999956

When you try to manually deserialize you get the same datetimes as Panda. Unfortunately, the way Excel stores times makes it impossible to represent some values exactly, so you have to round them for displaying purposes. I'm not sure if rounded data is needed for analysis, though.

This is the script I used to test that deserialized datetimes are really the same ones as Panda:

from datetime import date, datetime, time, timedelta
from urllib2 import urlopen

def deserialize(text):
  tokens = text.split(".")
  date_tok = tokens[0]
  time_tok = tokens[1] if len(tokens) == 2 else "0"
  d = date(1899, 12, 31) + timedelta(int(date_tok))
  t = time(*helper(float("0." + time_tok), (24, 60, 60, 1000000)))
  return datetime.combine(d, t)

def helper(factor, units):
  result = list()
  for unit  in units:
    value, factor = divmod(factor * unit, 1)
    result.append(int(value))
  return result

url = "https://gist.github.com/RaffaeleSgarro/877d7449bd19722b44cb/raw/" \
      "45d5f0b339d4abf3359fe673fcd2976374ed61b8/dates.txt"

for line in urlopen(url):
  print deserialize(line)

Upvotes: 2

Related Questions