Paul Fleming
Paul Fleming

Reputation: 536

Read large csv file with non-standard timestamp column

I'm trying to read a csv file with data like this:

data.csv:

time,value
12/31/2015 7:00:00 PM,275.9271
12/31/2015 7:00:02 PM,278.559
12/31/2015 7:00:10 PM,277.5949
12/31/2015 7:00:12 PM,270.0564

My problem is that there is a lot of data and if I try to read the file while telling the reader to parse the time column like:

df = pd.read_csv('data.csv', parse_dates = ['time'])

It is very slow. If on the other hand I read the data, and then try to convert the time column, it is also slow:

df['time'] = pd.to_datetime(df.time)

After looking around the web, the solution seems to be to specify the format of the time, but this fails because the months and hours are missing zero-padding:

pd.to_datetime(df.time,format='%b/%d/%Y %H:%M:%S %p')

ValueError: time data '10/26/2016 7:00:01 PM' does not match format '%b/%d/%Y %H:%M:%S %p' (match)

Is there a way to make a format string to match non-standard data? Or else to speed up one of the auto-recognition methods? Thank you!

Upvotes: 2

Views: 320

Answers (1)

G_M
G_M

Reputation: 3382

I think the %b should be %m?

%b Month as locale’s abbreviated name.

%m Month as a zero-padded decimal number.

>>> from datetime import datetime
>>> d = datetime.strptime('12/31/2015 7:00:00 PM', '%b/%d/%Y %H:%M:%S %p')
ValueError: time data '12/31/2015 7:00:00 PM' does not match format '%b/%d/%Y %H:%M:%S %p'
>>> d = datetime.strptime('12/31/2015 7:00:00 PM', '%m/%d/%Y %I:%M:%S %p')
>>> d
datetime.datetime(2015, 12, 31, 19, 0)

I find myself going back to this site frequently: http://strftime.org/

Or just the Python docs: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

Example

import pandas as pd

df = pd.DataFrame({'time': {0: '12/31/2015 7:00:00 PM',
                            1: '12/31/2015 7:00:02 PM',
                            2: '12/31/2015 7:00:10 PM',
                            3: '12/31/2015 7:00:12 PM'},
                   'value': {0: 275.9271, 1: 278.559, 2: 277.5949, 3: 270.0564}})

df['time'] = pd.to_datetime(df['time'], format='%m/%d/%Y %I:%M:%S %p')

#                  time     value
# 0 2015-12-31 19:00:00  275.9271
# 1 2015-12-31 19:00:02  278.5590
# 2 2015-12-31 19:00:10  277.5949
# 3 2015-12-31 19:00:12  270.0564

Upvotes: 2

Related Questions