Legend_Ari
Legend_Ari

Reputation: 321

Pandas DateTimeIndex

I need to have a DateTimeIndex for my dataframe. Problem is my source file. The Date header is Date(dd-mm-yy), but the actual date data has the format dd:mm:yy (24:06:1970) etc. I have lots of source files so manually changing the header would be tedious and not good programing practice. How would one go about addressing this from within python?

Perhaps creating a copy of the source file opening it, searching for the date header, changing it and then closing it? I'm new to python so I'm not exactly sure if this is the best way to go about doing things and if it is, how do I implement such code?

Currently I have this;

df = pd.read_csv('test.csv',
                    skiprows = 4,
                    parse_dates = {'stamp':[0,1]},
                    na_values = 'NaN',                    
                    index_col = 'stamp'
                 )

Where column 0 is the date column in question and column 1 is the time column. I don't get any error messages just erroneous data.

Sorry, I should have added a snippet of the csv file in question.I've now provided it below;

some stuff I dont want 
some stuff I dont want 
some stuff I dont want 
some stuff I dont want 
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day
01:07:2013,05:40:41,182.236586,659,1638.400000
01:07:2013,05:44:03,182.238924,659,1638.400000
01:07:2013,05:47:48,182.241528,659,1638.400000
01:07:2013,05:52:21,182.244687,659,1638.400000

Upvotes: 1

Views: 796

Answers (2)

U2EF1
U2EF1

Reputation: 13261

Your dates are really weird, you should just fix them all. If you really can't fix them on disk for some reason, I guess you can do it inline:

import re
from StringIO import StringIO
s = open('test.csv').read()
def rep(m):
    return '%s-%s-%sT' % (m.group('YY'), m.group('mm'), m.group('dd'))
s = re.sub(r'^(?P<dd>\d\d):(?P<mm>\d\d):(?P<YY>\d{4}),', rep, s, flags=re.M)
df = pd.read_csv(StringIO(s), skiprows=5, index_col=0, 
                 names=['time', 'Julian_Day', 'col_2', 'col_3'])

This just takes the weird dates like 01:07:2013,05:40:41 and formats them ISO style like 2013-07-01T05:40:41. Then pandas can treat them normally. Bear in mind that these are going to be in UTC.

Upvotes: 0

Laurence Billingham
Laurence Billingham

Reputation: 803

I think the main problem is that the header line Date(dd-mm-yy), Time(hh:mm:ss), Julian_Day only appears to specify some of the column names. Pandas cannot infer what to do with the other data.

Try skipping the file's column name line and passing pandas a list of column names and defining your own date_parser:

def my_parser(date, time):
    import datetime
    DATE_FORMAT = '%d:%m:%Y'
    TIME_FORMAT = '%H:%M:%S'
    date = datetime.datetime.strptime(date, DATE_FORMAT)
    time_weird_date = datetime.datetime.strptime(time, TIME_FORMAT)
    return datetime.datetime.combine(date, time_weird_date.time())

import pandas as pd
from cStringIO import StringIO

data = """\
some stuff I dont want 
some stuff I dont want 
some stuff I dont want 
some stuff I dont want 
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day
01:07:2013,05:40:41,182.236586,659,1638.400000
01:07:2013,05:44:03,182.238924,659,1638.400000
01:07:2013,05:47:48,182.241528,659,1638.400000
01:07:2013,05:52:21,182.244687,659,1638.400000
"""  

pd.read_csv(StringIO(data), skiprows=5, index_col=0, 
            parse_dates={'datetime':['date', 'time']}, 
            names=['date','time', 'Julian_Day', 'col_2', 'col_3'],
            date_parser=my_parser)

This should give you what you want.

As you said you are new to python, I should add that the from cStringIO import StringIO, data = """..., and StringIO(data) parts are just so I could include the data directly in this answer in a runnable form. You just need pd.read_csv(my_data_filename, ... in your own code

Upvotes: 2

Related Questions