Rachel
Rachel

Reputation: 1967

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:

cal        month  year  day
3668.4333      1  2002   10
3652.2498      1  2002   11
3647.8662      1  2002   12
3646.6843      1  2002   13
...
3661.9414      2  2003   14

# data types
cal      float64
month      int64
year       int64
day        int64

I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:

import pandas as pd
from pandas import Series, DataFrame, Panel

data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)

My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.

Upvotes: 2

Views: 1029

Answers (1)

jezrael
jezrael

Reputation: 862511

You can use parameter parse_dates where define column names in list in read_csv:

import pandas as pd
import numpy as np
import io

temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])

print (df)
  year_month_day        cal
0     2002-01-10  3668.4333
1     2002-01-11  3652.2498
2     2002-01-12  3647.8662
3     2002-01-13  3646.6843
4     2003-02-14  3661.9414

print (df.dtypes)
year_month_day    datetime64[ns]
cal                      float64
dtype: object

Then you can rename column:

df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
        date        cal
0 2002-01-10  3668.4333
1 2002-01-11  3652.2498
2 2002-01-12  3647.8662
3 2002-01-13  3646.6843
4 2003-02-14  3661.9414

Or better is pass dictionary with new column name to parse_dates:

df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})

print (df)
       dates        cal
0 2002-01-10  3668.4333
1 2002-01-11  3652.2498
2 2002-01-12  3647.8662
3 2002-01-13  3646.6843
4 2003-02-14  3661.9414

Upvotes: 2

Related Questions