Reputation: 321
I need to have a DateTimeIndex for my dataframe. Problem is my source file. The Date header is Date(dd-mm-yy), but the actual date data has the format dd:mm:yy (24:06:1970) etc. I have lots of source files so manually changing the header would be tedious and not good programing practice. How would one go about addressing this from within python?
Perhaps creating a copy of the source file opening it, searching for the date header, changing it and then closing it? I'm new to python so I'm not exactly sure if this is the best way to go about doing things and if it is, how do I implement such code?
Currently I have this;
df = pd.read_csv('test.csv',
skiprows = 4,
parse_dates = {'stamp':[0,1]},
na_values = 'NaN',
index_col = 'stamp'
)
Where column 0 is the date column in question and column 1 is the time column. I don't get any error messages just erroneous data.
Sorry, I should have added a snippet of the csv file in question.I've now provided it below;
some stuff I dont want
some stuff I dont want
some stuff I dont want
some stuff I dont want
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day
01:07:2013,05:40:41,182.236586,659,1638.400000
01:07:2013,05:44:03,182.238924,659,1638.400000
01:07:2013,05:47:48,182.241528,659,1638.400000
01:07:2013,05:52:21,182.244687,659,1638.400000
Upvotes: 1
Views: 796
Reputation: 13261
Your dates are really weird, you should just fix them all. If you really can't fix them on disk for some reason, I guess you can do it inline:
import re
from StringIO import StringIO
s = open('test.csv').read()
def rep(m):
return '%s-%s-%sT' % (m.group('YY'), m.group('mm'), m.group('dd'))
s = re.sub(r'^(?P<dd>\d\d):(?P<mm>\d\d):(?P<YY>\d{4}),', rep, s, flags=re.M)
df = pd.read_csv(StringIO(s), skiprows=5, index_col=0,
names=['time', 'Julian_Day', 'col_2', 'col_3'])
This just takes the weird dates like 01:07:2013,05:40:41
and formats them ISO style like 2013-07-01T05:40:41
. Then pandas can treat them normally. Bear in mind that these are going to be in UTC.
Upvotes: 0
Reputation: 803
I think the main problem is that the header line Date(dd-mm-yy), Time(hh:mm:ss), Julian_Day
only appears to specify some of the column names. Pandas
cannot infer what to do with the other data.
Try skipping the file's column name line and passing pandas
a list of column names
and defining your own date_parser
:
def my_parser(date, time):
import datetime
DATE_FORMAT = '%d:%m:%Y'
TIME_FORMAT = '%H:%M:%S'
date = datetime.datetime.strptime(date, DATE_FORMAT)
time_weird_date = datetime.datetime.strptime(time, TIME_FORMAT)
return datetime.datetime.combine(date, time_weird_date.time())
import pandas as pd
from cStringIO import StringIO
data = """\
some stuff I dont want
some stuff I dont want
some stuff I dont want
some stuff I dont want
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day
01:07:2013,05:40:41,182.236586,659,1638.400000
01:07:2013,05:44:03,182.238924,659,1638.400000
01:07:2013,05:47:48,182.241528,659,1638.400000
01:07:2013,05:52:21,182.244687,659,1638.400000
"""
pd.read_csv(StringIO(data), skiprows=5, index_col=0,
parse_dates={'datetime':['date', 'time']},
names=['date','time', 'Julian_Day', 'col_2', 'col_3'],
date_parser=my_parser)
This should give you what you want.
As you said you are new to python, I should add that the from cStringIO import StringIO
, data = """...
, and StringIO(data)
parts are just so I could include the data directly in this answer in a runnable form. You just need pd.read_csv(my_data_filename, ...
in your own code
Upvotes: 2